Skip to content

DBSCAN


🎯 Purpose

This QuickRef explains how to use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — an unsupervised algorithm that groups data based on point density.


📦 1. When to Use

Condition Use DBSCAN?
Clusters are non-spherical or uneven sizes ✅ Yes
You want to find arbitrary-shaped clusters ✅ Yes
You want to detect noise/outliers ✅ Yes
Dataset is large and high-dimensional ❌ Can be slow (use HDBSCAN)

🧮 2. Core Logic

  • Groups together points that are closely packed
  • Labels points in low-density regions as outliers
  • Requires no predefined number of clusters

🛠️ 3. Fitting in sklearn

from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
labels = model.labels_  # -1 = noise points

✔️ Scale features before use (StandardScaler or MinMaxScaler)


🔧 4. Key Hyperparameters

Param Description
eps Max distance for neighbors to be considered part of the same cluster
min_samples Min number of neighbors required to form a dense region
metric Distance function (default = Euclidean)

📊 5. Evaluating Clusters

Metric Use When...
Silhouette Score Works if labels are ≥ 2 clusters
Number of Noise Points (label = -1) Helps assess purity vs over-segmentation
Visual Inspection PCA, t-SNE, UMAP visual validation

⚠️ 6. Limitations

  • Can be sensitive to eps selection
  • Performance drops in high dimensions
  • May fail if density varies too much between clusters

✅ Checklist

  • [ ] Features scaled to uniform range
  • [ ] eps and min_samples tuned or estimated with k-distance plot
  • [ ] Outliers reviewed (label == -1)
  • [ ] Dimensionality reduction used for visualization if needed
  • [ ] Cluster evaluation interpreted contextually (not just scores)

💡 Tip

“If KMeans needs structure, DBSCAN thrives in chaos — and outliers are part of the story.”