Statistical Summary
๐ฏ Purpose
This reference provides statistical tools and interpretation guidelines to evaluate clustering models using both internal metrics and descriptive summaries. It supports unsupervised model inspection, profiling, and stakeholder-ready reporting.
๐ 1. Internal Validation Metrics (No Ground Truth Needed)¶
โ Silhouette Score¶
Measures cohesion vs separation:
- Closer to 1 = better-defined clusters
- 0 = overlapping boundaries
- < 0 = misassignment
from sklearn.metrics import silhouette_score
silhouette_score(X_scaled, cluster_labels)
โ Calinski-Harabasz Index¶
Ratio of between-cluster to within-cluster dispersion:
- Higher is better
from sklearn.metrics import calinski_harabasz_score
calinski_harabasz_score(X_scaled, cluster_labels)
โ Davies-Bouldin Index¶
Average similarity between clusters:
- Lower is better
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X_scaled, cluster_labels)
โ Within-Cluster Sum of Squares (WCSS)¶
Used for elbow method in KMeans.
inertia = model.inertia_ # For KMeans
โ Average Distance to Centroid¶
Custom distance-based dispersion metric.
import numpy as np
from sklearn.metrics import pairwise_distances_argmin_min
_, dists = pairwise_distances_argmin_min(model.cluster_centers_, X_scaled)
dists.mean()
๐ 2. Descriptive Cluster Statistics (Per Group)¶
Summarize cluster structure using standard statistics.
import pandas as pd
X['cluster'] = cluster_labels
X.groupby('cluster').agg(['mean', 'std', 'median', 'min', 'max'])
Metric | Purpose |
---|---|
Mean / Std | Centrality and spread |
Min / Max | Range analysis |
Count | Cluster size distribution |
Median | Robust central tendency |
๐ 3. Suggested Summary Table Columns¶
- Cluster label
- Sample count
- Top 3 distinguishing features (Z-score or absolute mean diff)
- Feature summaries (mean, std)
๐ 4. Reproducibility & Delta Tracking¶
- Save centroids or medoids for future comparisons
- Compare distributions over time or version
- Track silhouette or CH index over re-runs
np.save('cluster_centroids_v1.npy', model.cluster_centers_)
๐งช 5. Bonus: Cluster Quality Tiers (Silhouette Guidelines)¶
Score Range | Interpretation |
---|---|
0.70โ1.00 | Strong structure |
0.50โ0.70 | Reasonable structure |
0.25โ0.50 | Weak structure |
< 0.25 | Overlapping or noise-driven |
๐ฆ Reporting Tip¶
Pair statistical summaries with visual diagnostics (UMAP, silhouette plots, radar charts) for maximum interpretability.
Use this with: Clustering Visual Guide, Checklist, and Decision Card.