Skip to content

HDBSCAN


๐ŸŽฏ Purpose

This QuickRef explains how to use HDBSCAN โ€” a powerful clustering algorithm that extends DBSCAN to better handle variable-density clusters and large, high-dimensional datasets.


๐Ÿ“ฆ 1. When to Use

Condition Use HDBSCAN?
You want density-based clustering โœ… Yes
DBSCAN fails due to varying densities โœ… Yes
You want soft clustering (membership strength) โœ… Yes
You want speed + scalability โœ… Yes
Your data is small/simple โŒ Use DBSCAN or KMeans

๐Ÿงฎ 2. Core Logic

  • Builds a hierarchy of density-based clusters
  • Condenses it into the most stable clusters
  • Assigns probability of membership to each point (soft clustering)
  • Automatically determines number of clusters (no k needed)

๐Ÿ› ๏ธ 3. Fitting in Python

import hdbscan
model = hdbscan.HDBSCAN(min_cluster_size=10)
model.fit(X)
labels = model.labels_  # -1 = noise
probs = model.probabilities_  # Soft cluster membership

โœ”๏ธ Requires feature scaling before fitting


๐Ÿ”ง 4. Key Parameters

Param Description
min_cluster_size Minimum size for a dense cluster
min_samples Influences how conservative the clustering is
metric Distance function (e.g. 'euclidean')
cluster_selection_method 'eom' (default) or 'leaf'

๐Ÿ“Š 5. Evaluation + Visualization

Tool Purpose
Soft cluster probabilities Helps visualize fuzzy memberships
t-SNE / UMAP Great for showing density hierarchy
Outlier scores Points with low stability can be flagged

โš ๏ธ 6. Limitations

  • Less intuitive than DBSCAN or KMeans
  • Parameter tuning requires exploration
  • No native sklearn support (third-party package)

โœ… Checklist

  • [ ] Data scaled before fitting
  • [ ] min_cluster_size + min_samples tuned
  • [ ] Noise points reviewed (label = -1)
  • [ ] Membership probabilities visualized or used
  • [ ] Dimensionality reduction (e.g. UMAP) used to assist interpretation

๐Ÿ’ก Tip

โ€œHDBSCAN doesnโ€™t ask for k. It lets your data tell you where the clusters live โ€” and how sure it is.โ€