DBSCAN
🎯 Purpose
This QuickRef explains how to use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — an unsupervised algorithm that groups data based on point density.
📦 1. When to Use¶
Condition | Use DBSCAN? |
---|---|
Clusters are non-spherical or uneven sizes | ✅ Yes |
You want to find arbitrary-shaped clusters | ✅ Yes |
You want to detect noise/outliers | ✅ Yes |
Dataset is large and high-dimensional | ❌ Can be slow (use HDBSCAN) |
🧮 2. Core Logic¶
- Groups together points that are closely packed
- Labels points in low-density regions as outliers
- Requires no predefined number of clusters
🛠️ 3. Fitting in sklearn¶
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
labels = model.labels_ # -1 = noise points
✔️ Scale features before use (StandardScaler
or MinMaxScaler
)
🔧 4. Key Hyperparameters¶
Param | Description |
---|---|
eps |
Max distance for neighbors to be considered part of the same cluster |
min_samples |
Min number of neighbors required to form a dense region |
metric |
Distance function (default = Euclidean) |
📊 5. Evaluating Clusters¶
Metric | Use When... |
---|---|
Silhouette Score | Works if labels are ≥ 2 clusters |
Number of Noise Points (label = -1 ) |
Helps assess purity vs over-segmentation |
Visual Inspection | PCA, t-SNE, UMAP visual validation |
⚠️ 6. Limitations¶
- Can be sensitive to
eps
selection - Performance drops in high dimensions
- May fail if density varies too much between clusters
✅ Checklist¶
- [ ] Features scaled to uniform range
- [ ]
eps
andmin_samples
tuned or estimated with k-distance plot - [ ] Outliers reviewed (
label == -1
) - [ ] Dimensionality reduction used for visualization if needed
- [ ] Cluster evaluation interpreted contextually (not just scores)
💡 Tip¶
“If KMeans needs structure, DBSCAN thrives in chaos — and outliers are part of the story.”