DBSCAN
🎯 Purpose
This QuickRef explains how to use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — an unsupervised algorithm that groups data based on point density.
📦 1. When to Use¶
| Condition | Use DBSCAN? |
|---|---|
| Clusters are non-spherical or uneven sizes | ✅ Yes |
| You want to find arbitrary-shaped clusters | ✅ Yes |
| You want to detect noise/outliers | ✅ Yes |
| Dataset is large and high-dimensional | ❌ Can be slow (use HDBSCAN) |
🧮 2. Core Logic¶
- Groups together points that are closely packed
- Labels points in low-density regions as outliers
- Requires no predefined number of clusters
🛠️ 3. Fitting in sklearn¶
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
labels = model.labels_ # -1 = noise points
✔️ Scale features before use (StandardScaler or MinMaxScaler)
🔧 4. Key Hyperparameters¶
| Param | Description |
|---|---|
eps |
Max distance for neighbors to be considered part of the same cluster |
min_samples |
Min number of neighbors required to form a dense region |
metric |
Distance function (default = Euclidean) |
📊 5. Evaluating Clusters¶
| Metric | Use When... |
|---|---|
| Silhouette Score | Works if labels are ≥ 2 clusters |
Number of Noise Points (label = -1) |
Helps assess purity vs over-segmentation |
| Visual Inspection | PCA, t-SNE, UMAP visual validation |
⚠️ 6. Limitations¶
- Can be sensitive to
epsselection - Performance drops in high dimensions
- May fail if density varies too much between clusters
✅ Checklist¶
- [ ] Features scaled to uniform range
- [ ]
epsandmin_samplestuned or estimated with k-distance plot - [ ] Outliers reviewed (
label == -1) - [ ] Dimensionality reduction used for visualization if needed
- [ ] Cluster evaluation interpreted contextually (not just scores)
💡 Tip¶
“If KMeans needs structure, DBSCAN thrives in chaos — and outliers are part of the story.”