Skip to content

Statistical Summary


๐ŸŽฏ Purpose

This reference provides statistical tools and interpretation guidelines to evaluate clustering models using both internal metrics and descriptive summaries. It supports unsupervised model inspection, profiling, and stakeholder-ready reporting.


๐Ÿ“ 1. Internal Validation Metrics (No Ground Truth Needed)

โœ… Silhouette Score

Measures cohesion vs separation:

  • Closer to 1 = better-defined clusters
  • 0 = overlapping boundaries
  • < 0 = misassignment
from sklearn.metrics import silhouette_score
silhouette_score(X_scaled, cluster_labels)

โœ… Calinski-Harabasz Index

Ratio of between-cluster to within-cluster dispersion:

  • Higher is better
from sklearn.metrics import calinski_harabasz_score
calinski_harabasz_score(X_scaled, cluster_labels)

โœ… Davies-Bouldin Index

Average similarity between clusters:

  • Lower is better
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X_scaled, cluster_labels)

โœ… Within-Cluster Sum of Squares (WCSS)

Used for elbow method in KMeans.

inertia = model.inertia_  # For KMeans

โœ… Average Distance to Centroid

Custom distance-based dispersion metric.

import numpy as np
from sklearn.metrics import pairwise_distances_argmin_min
_, dists = pairwise_distances_argmin_min(model.cluster_centers_, X_scaled)
dists.mean()

๐Ÿ“Š 2. Descriptive Cluster Statistics (Per Group)

Summarize cluster structure using standard statistics.

import pandas as pd
X['cluster'] = cluster_labels
X.groupby('cluster').agg(['mean', 'std', 'median', 'min', 'max'])
Metric Purpose
Mean / Std Centrality and spread
Min / Max Range analysis
Count Cluster size distribution
Median Robust central tendency

๐Ÿ“‹ 3. Suggested Summary Table Columns

  • Cluster label
  • Sample count
  • Top 3 distinguishing features (Z-score or absolute mean diff)
  • Feature summaries (mean, std)

๐Ÿ” 4. Reproducibility & Delta Tracking

  • Save centroids or medoids for future comparisons
  • Compare distributions over time or version
  • Track silhouette or CH index over re-runs
np.save('cluster_centroids_v1.npy', model.cluster_centers_)

๐Ÿงช 5. Bonus: Cluster Quality Tiers (Silhouette Guidelines)

Score Range Interpretation
0.70โ€“1.00 Strong structure
0.50โ€“0.70 Reasonable structure
0.25โ€“0.50 Weak structure
< 0.25 Overlapping or noise-driven

๐Ÿ“ฆ Reporting Tip

Pair statistical summaries with visual diagnostics (UMAP, silhouette plots, radar charts) for maximum interpretability.

Use this with: Clustering Visual Guide, Checklist, and Decision Card.