KMeans
๐ฏ Purpose¶
This QuickRef explains how to use KMeans Clustering โ a foundational unsupervised learning algorithm that groups data into k clusters based on similarity.
๐ฆ 1. When to Use¶
| Condition | Use KMeans? |
|---|---|
| You want to segment data by similarity | โ Yes |
| Clusters are spherical, evenly sized | โ Yes |
| Features are numeric and scaled | โ Yes |
| Irregularly shaped clusters exist | โ Use DBSCAN or HDBSCAN |
๐งฎ 2. Core Logic¶
- Randomly initializes
kcentroids - Iteratively assigns points to closest centroid
- Updates centroids until convergence
- Minimizes within-cluster sum of squares (WCSS)
๐ ๏ธ 3. Fitting in sklearn¶
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
labels = model.labels_
centroids = model.cluster_centers_
โ๏ธ Always scale data before fitting (StandardScaler or MinMaxScaler)
๐ง 4. Key Hyperparameters¶
| Param | Description |
|---|---|
n_clusters |
Number of clusters (k) |
init |
Initialization method ('k-means++' recommended) |
n_init |
Number of runs to choose best model (default = 10) |
max_iter |
Max number of iterations (default = 300) |
๐ 5. Evaluating Clusters¶
| Metric | Use When... |
|---|---|
| Elbow Method | Plot inertia vs k to find optimal point |
| Silhouette Score | Measures cohesion + separation (closer to 1 = better) |
| Davies-Bouldin | Lower = better clustering separation |
from sklearn.metrics import silhouette_score
score = silhouette_score(X, model.labels_)
โ ๏ธ 6. Limitations¶
- Assumes spherical clusters of equal variance
- Sensitive to initialization and outliers
- Requires predefined k (not data-driven)
โ Checklist¶
- [ ] Data scaled appropriately
- [ ]
kchosen via Elbow or Silhouette method - [ ] Initialization method set to
k-means++ - [ ] Labels interpreted and assigned post-fit
- [ ] Visualizations used to confirm structure (e.g. PCA plot)
๐ก Tip¶
โKMeans is simple, scalable, and fast โ but itโs only as smart as your choice of k.โ