HDBSCAN
๐ฏ Purpose¶
This QuickRef explains how to use HDBSCAN โ a powerful clustering algorithm that extends DBSCAN to better handle variable-density clusters and large, high-dimensional datasets.
๐ฆ 1. When to Use¶
| Condition | Use HDBSCAN? |
|---|---|
| You want density-based clustering | โ Yes |
| DBSCAN fails due to varying densities | โ Yes |
| You want soft clustering (membership strength) | โ Yes |
| You want speed + scalability | โ Yes |
| Your data is small/simple | โ Use DBSCAN or KMeans |
๐งฎ 2. Core Logic¶
- Builds a hierarchy of density-based clusters
- Condenses it into the most stable clusters
- Assigns probability of membership to each point (soft clustering)
- Automatically determines number of clusters (no k needed)
๐ ๏ธ 3. Fitting in Python¶
import hdbscan
model = hdbscan.HDBSCAN(min_cluster_size=10)
model.fit(X)
labels = model.labels_ # -1 = noise
probs = model.probabilities_ # Soft cluster membership
โ๏ธ Requires feature scaling before fitting
๐ง 4. Key Parameters¶
| Param | Description |
|---|---|
min_cluster_size |
Minimum size for a dense cluster |
min_samples |
Influences how conservative the clustering is |
metric |
Distance function (e.g. 'euclidean') |
cluster_selection_method |
'eom' (default) or 'leaf' |
๐ 5. Evaluation + Visualization¶
| Tool | Purpose |
|---|---|
| Soft cluster probabilities | Helps visualize fuzzy memberships |
| t-SNE / UMAP | Great for showing density hierarchy |
| Outlier scores | Points with low stability can be flagged |
โ ๏ธ 6. Limitations¶
- Less intuitive than DBSCAN or KMeans
- Parameter tuning requires exploration
- No native
sklearnsupport (third-party package)
โ Checklist¶
- [ ] Data scaled before fitting
- [ ]
min_cluster_size+min_samplestuned - [ ] Noise points reviewed (
label = -1) - [ ] Membership probabilities visualized or used
- [ ] Dimensionality reduction (e.g. UMAP) used to assist interpretation
๐ก Tip¶
โHDBSCAN doesnโt ask for
k. It lets your data tell you where the clusters live โ and how sure it is.โ