K-Means Clustering

  • K-Means is a popular clustering algorithm that divides data into K clusters based on similarity using centroid-based grouping.
  • K-Means Clustering

    K-Means is an unsupervised learning algorithm used for clustering — grouping similar data points into clusters based on similarity.

    Centroid Concept

    • Each cluster has a centroid (mean position of points in that cluster)

    • Algorithm assigns each data point to the nearest centroid

    • After assignment, centroids are updated iteratively until they stabilize

    Steps

    1. Initialize K centroids randomly

    2. Assign each data point to nearest centroid

    3. Recalculate centroids as mean of points in cluster

    4. Repeat steps 2-3 until centroids do not change


    K Value Selection

    • K = Number of clusters

    • Choosing the right K is critical for meaningful clusters

    Common Methods:

    • Elbow Method

    • Silhouette Score


    Elbow Method

    • Plot Inertia (within-cluster sum of squares) vs number of clusters (K)

    • Inertia = Sum of squared distances of points from their cluster centroid

    Inertia=∑i=1K∑x∈Ci∣∣x−μi∣∣2Inertia = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2Inertia=i=1∑K​x∈Ci​∑​∣∣x−μi​∣∣2

    • Look for “elbow point” where inertia stops decreasing sharply → optimal K


    Inertia

    • Measures compactness of clusters

    • Lower inertia → points closer to centroids → tighter clusters

    • Too low inertia → might overfit (too many clusters)


    Advantages & Limitations

    Advantages

    1. Simple and easy to implement

    2. Fast and efficient on large datasets

    3. Works well for spherical clusters

    Limitations

    1. Need to specify K beforehand

    2. Sensitive to initial centroid placement

    3. Poor performance for non-spherical clusters

    4. Sensitive to outliers

    Example: K-Means Clustering

K-Means Clustering Example in Python with Elbow Method

This Python example demonstrates how to use the K-Means clustering algorithm to group data points into clusters. The code first applies the Elbow Method to determine the optimal number of clusters (K) by analyzing inertia values. After selecting K=2, it trains the K-Means model, predicts cluster labels, and visualizes the clusters along with their centroids using Matplotlib.

# Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Step 2: Create Dataset
X = np.array([
    [1, 2],
    [1, 4],
    [1, 0],
    [10, 2],
    [10, 4],
    [10, 0]
])

# Step 3: Elbow Method to find optimal K
inertia = []
K_range = range(1, 6)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot Inertia vs K
plt.plot(K_range, inertia, 'bo-')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()

# Step 4: Apply K-Means with K=2
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Step 5: Plot Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title("K-Means Clustering")
plt.show()
  • Output:

Lesson image
Lesson image