K-means clustering

K-means is an unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping subgroups (clusters). It uses Clustering metrics for evaluation.
Not to be confused with knn. KNN is a supervised algorithm for classification/regression finding nearest neighbors, K-means is an unsupervised algorithm for clustering finding cluster centroids.

How It Works

Choose the number K of clusters
Initialize K centroids randomly
Repeat until convergence:
- Assign each data point to the nearest centroid
- Update the centroids by calculating the mean of all points assigned to that centroid
  The algorithm has converged when the assignments no longer change
Initialization: $μ_{1}, μ_{2}, ..., μ_{K} \in R^{n}$
Assignment Step: $c^{(i)} := ar g min_{j} ∣∣ x^{(i)} - μ_{j} ∣ ∣^{2}$
Update Step: $μ_{j} := \frac{\sum _{i = 1}^{m} 1 c ^{(i)} = j x ^{(i)}}{\sum _{i = 1}^{m} 1 c ^{(i)} = j}$

Where:

$μ_{j}$ is the centroid for cluster j
$c^{(i)}$ is the cluster assignment for data point i
$x^{(i)}$ is the i-th data point

Distance Metrics

Typically uses Euclidean distance: $d (x, y) = \sum_{i = 1}^{n} (x_{i} - y_{i})^{2}$

Choosing K

Elbow Method: Plot the within-cluster sum of squares (WCSS) against K and look for the “elbow”
Silhouette Analysis: Measure how similar an object is to its own cluster compared to other clusters
Gap Statistic: Compare the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data

Advantages

Simple
Scales well to large datasets
Guarantees convergence
Can warm-start with known centroids if available

Disadvantages

Needs to specify K beforehand
Sensitive to initial centroid placement
Sensitive to outliers

Variations

K-means++

K-means++ helps the convergence by improving initialization. It selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia.

Mini-batch K-means (K-means with buckets)

Instead of using the entire dataset in each iteration, Mini-Batch K-means uses small, random batches of data. The process repeats until convergence or a maximum number of iterations is reached.

Code example

> import numpy as np
 
class KMeans:
    def __init__(self, k=3, max_iters=100):
        self.k = k
        self.max_iters = max_iters
 
    def kmeans_plus_plus(self, X):
        centroids = [X[np.random.randint(X.shape[0])]]
        for _ in range(1, self.k):
            distances = np.min([np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0)
            probabilities = distances / distances.sum()
            cumulative_probabilities = probabilities.cumsum()
            r = np.random.rand()
            for j, p in enumerate(cumulative_probabilities):
                if r < p:
                    centroids.append(X[j])
                    break
        return np.array(centroids)
 
    def fit(self, X):
        # Randomly initialize centroids
        self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]
        # Or use K-means++ initialization
        self.centroids = self.kmeans_plus_plus(X)
 
        for _ in range(self.max_iters):
            # Assign points to closest centroid
            distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
            labels = np.argmin(distances, axis=0)
 
            # Update centroids
            new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.k)])
 
            # Check for convergence
            if np.all(self.centroids == new_centroids):
                break
 
            self.centroids = new_centroids
 
        return labels
 
    def predict(self, X):
        distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)

DSWoK — Data Science Well of Knowledge

Explorer

K-means clustering

How It Works

Distance Metrics

Choosing K

Advantages

Disadvantages

Variations

K-means++

Mini-batch K-means (K-means with buckets)

Links

Graph View

Table of Contents

Backlinks