K-means is an unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping subgroups (clusters). It uses Clustering metrics for evaluation.
Not to be confused with knn. KNN is a supervised algorithm for classification/regression finding nearest neighbors, K-means is an unsupervised algorithm for clustering finding cluster centroids.

How It Works

  1. Choose the number K of clusters

  2. Initialize K centroids randomly

  3. Repeat until convergence:

    • Assign each data point to the nearest centroid
    • Update the centroids by calculating the mean of all points assigned to that centroid
      The algorithm has converged when the assignments no longer change
  4. Initialization:

  5. Assignment Step:

  6. Update Step:

Where:

  • is the centroid for cluster j
  • is the cluster assignment for data point i
  • is the i-th data point

Distance Metrics

Typically uses Euclidean distance:

Choosing K

  1. Elbow Method: Plot the within-cluster sum of squares (WCSS) against K and look for the “elbow”
  2. Silhouette Analysis: Measure how similar an object is to its own cluster compared to other clusters
  3. Gap Statistic: Compare the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data

Advantages

  1. Simple
  2. Scales well to large datasets
  3. Guarantees convergence
  4. Can warm-start with known centroids if available

Disadvantages

  1. Needs to specify K beforehand
  2. Sensitive to initial centroid placement
  3. Sensitive to outliers

Variations

K-means++

K-means++ helps the convergence by improving initialization. It selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia.

Mini-batch K-means (K-means with buckets)

Instead of using the entire dataset in each iteration, Mini-Batch K-means uses small, random batches of data. The process repeats until convergence or a maximum number of iterations is reached.