Skip to content

KMeans


๐ŸŽฏ Purpose

This QuickRef explains how to use KMeans Clustering โ€” a foundational unsupervised learning algorithm that groups data into k clusters based on similarity.


๐Ÿ“ฆ 1. When to Use

Condition Use KMeans?
You want to segment data by similarity โœ… Yes
Clusters are spherical, evenly sized โœ… Yes
Features are numeric and scaled โœ… Yes
Irregularly shaped clusters exist โŒ Use DBSCAN or HDBSCAN

๐Ÿงฎ 2. Core Logic

  • Randomly initializes k centroids
  • Iteratively assigns points to closest centroid
  • Updates centroids until convergence
  • Minimizes within-cluster sum of squares (WCSS)

๐Ÿ› ๏ธ 3. Fitting in sklearn

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
labels = model.labels_
centroids = model.cluster_centers_

โœ”๏ธ Always scale data before fitting (StandardScaler or MinMaxScaler)


๐Ÿ”ง 4. Key Hyperparameters

Param Description
n_clusters Number of clusters (k)
init Initialization method ('k-means++' recommended)
n_init Number of runs to choose best model (default = 10)
max_iter Max number of iterations (default = 300)

๐Ÿ“Š 5. Evaluating Clusters

Metric Use When...
Elbow Method Plot inertia vs k to find optimal point
Silhouette Score Measures cohesion + separation (closer to 1 = better)
Davies-Bouldin Lower = better clustering separation
from sklearn.metrics import silhouette_score
score = silhouette_score(X, model.labels_)

โš ๏ธ 6. Limitations

  • Assumes spherical clusters of equal variance
  • Sensitive to initialization and outliers
  • Requires predefined k (not data-driven)

โœ… Checklist

  • [ ] Data scaled appropriately
  • [ ] k chosen via Elbow or Silhouette method
  • [ ] Initialization method set to k-means++
  • [ ] Labels interpreted and assigned post-fit
  • [ ] Visualizations used to confirm structure (e.g. PCA plot)

๐Ÿ’ก Tip

โ€œKMeans is simple, scalable, and fast โ€” but itโ€™s only as smart as your choice of k.โ€