Clustering metrics - Data science Well of Knowledge

Clustering metrics are quantitative measures used to evaluate the performance and quality of clustering algorithms. ### Internal Metrics/Intrinsic (No Ground Truth Required) These metrics evaluate clustering performance based only on the data and the clustering result, without requiring true labels: 1. **Silhouette Coefficient**: Measures how similar an object is to its own cluster compared to other clusters. Values range from -1 to 1: +1 - good separation and cohesion, 0 - sample is close to the decision boundary between clusters, -1 - bad separation (too close to the points in a neighboring cluster). $\text{Silhouette}(i) = \frac{b(i) - a(i)}{\max{a(i), b(i)}}$ Where: - $a(i)$ is the mean distance between point $i$ and all other points in the same cluster - $b(i)$ is the mean distance between point $i$ and all points in the nearest neighboring cluster The Silhouette score for a clustering is the mean Silhouette coefficient over all samples: $\text{Silhouette Score} = \frac{1}{n} \sum_{i=1}^{n} \text{Silhouette}(i)$ 2. **Davies-Bouldin Index**: Measures the average similarity between clusters, where similarity is defined as the ratio between within-cluster distances and between-cluster distances. Lower values indicate better clustering (0 is minimum). $\text{DB} = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right)$ Where: - $k$ is the number of clusters - $\sigma_i$ is the average distance of all points in cluster $i$ to their cluster centroid - $d(c_i, c_j)$ is the distance between centroids of clusters $i$ and $j$ 3. **Calinski-Harabasz Index** (Variance Ratio Criterion): Ratio of between-cluster variance to within-cluster variance. Higher values indicate better-defined clusters. $\text{CH} = \frac{\text{Tr}(B_k) / (k-1)}{\text{Tr}(W_k) / (n-k)}$ Where: - $B_k$ is the between-cluster dispersion matrix - $W_k$ is the within-cluster dispersion matrix - $\text{Tr}$ is the trace of a matrix - $n$ is the total number of samples - $k$ is the number of clusters 4. **Dunn Index**: Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. Higher values indicate better clustering. $\text{Dunn} = \frac{\min_{i \neq j} d(c_i, c_j)}{\max_{1 \leq k \leq K} \text{diam}(c_k)}$ Where: - $d(c_i, c_j)$ is the distance between clusters $i$ and $j$ - $\text{diam}(c_k)$ is the diameter of cluster $k$ (maximum distance between any two points in the cluster) 5. **Inertia** (Within-cluster Sum of Squares): Sum of squared distances of samples to their closest cluster center. Lower values indicate more compact clusters. $\text{Inertia} = \sum_{i=1}^{n} \min_{c_j \in C} ||x_i - c_j||^2$ Where: - $x_i$ is the data point - $c_j$ is the center of cluster $j$ - $C$ is the set of all cluster centers ### External Metrics (Ground Truth Required) These metrics compare clustering results against known true labels: 1. **Adjusted Rand Index (ARI)**: Measures the similarity between two clusterings, adjusted for chance. Values range from -1 to 1, with 1 indicating perfect agreement, 0 indicating random assignment, and negative values indicating assignments worse than random. $\text{ARI} = \frac{\sum_{ij} \binom{n_{ij}}{2} - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \right] / \binom{n}{2}}{\frac{1}{2} \left[ \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \right] - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \right] / \binom{n}{2}}$ Where: - $n_{ij}$ is the number of objects that are in both class $i$ and cluster $j$ - $a_i$ is the number of objects in class $i$ - $b_j$ is the number of objects in cluster $j$ - $n$ is the total number of objects 2. **Normalized Mutual Information (NMI)**: Measures the mutual information between the clustering assignment and the ground truth, normalized by the average entropy of both. Values range from 0 to 1, with 1 indicating perfect agreement. $\text{NMI}(U, V) = \frac{2 \times I(U, V)}{H(U) + H(V)}$ Where: - $I(U, V)$ is the mutual information between clusterings $U$ and $V$ - $H(U)$ and $H(V)$ are the entropies of $U$ and $V$ respectively 3. **Fowlkes-Mallows Score**: Geometric mean of precision and recall. Values range from 0 to 1, with 1 indicating perfect agreement. $\text{FMI} = \sqrt{\frac{TP}{TP + FP} \times \frac{TP}{TP + FN}}$ Where TP, FP, and FN are derived from counting pairs of points that are: - True Positive (TP): in the same cluster in both clusterings - False Positive (FP): in the same cluster in the predicted clustering but not in the ground truth - False Negative (FN): in different clusters in the predicted clustering but in the same cluster in the ground truth 4. **Homogeneity, Completeness, and V-measure**: - **Homogeneity**: Each cluster contains only members of a single class (values from 0 to 1) - **Completeness**: All members of a given class are assigned to the same cluster (values from 0 to 1) - **V-measure**: Harmonic mean of homogeneity and completeness (values from 0 to 1) $\text{V-measure} = 2 \times \frac{\text{homogeneity} \times \text{completeness}}{\text{homogeneity} + \text{completeness}}$ 5. **Contingency Matrix**: A table showing the distribution of data points across predicted clusters and true classes. Not a metric itself, but the foundation for many external metrics. ### Determining Optimal Number of Clusters - **Elbow Method**: Plot inertia against number of clusters and look for the "elbow" point - **Silhouette Method**: Choose the number of clusters that maximizes the Silhouette score - **Gap Statistic**: Compare within-cluster dispersion to that expected under a null reference distribution ## Links - [Scikit-learn documentation on clustering metrics](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) - [Wikipedia: Cluster Analysis](https://en.wikipedia.org/wiki/Cluster_analysis)