Skip to content

EDA Guidebook


๐ŸŽฏ Purpose

This guide outlines how to explore and prepare unlabeled data for clustering. It focuses on understanding structure, distribution, feature behavior, and conditions that influence model choices such as K-Means, DBSCAN, HDBSCAN, GMM, and hierarchical clustering.


๐Ÿง  1. Confirm Unsupervised Clustering Use Case

  • [ ] โœ… No target/label column (unsupervised setting)
  • [ ] โœ… Goal: segmentation, anomaly detection, or structure discovery
  • [ ] โœ… Interpretability vs flexibility prioritized (e.g., centroid profiles vs irregular shapes)

๐Ÿ“Š 2. Shape & Structure of the Dataset

๐Ÿ”น Dimensionality Check

X.shape  # rows, features

โœ”๏ธ Consider dimensionality reduction (PCA, UMAP) when d > 10

๐Ÿ”น Missingness Summary

df.isnull().sum()

โœ”๏ธ Ensure consistent imputation or deletion strategy


๐Ÿ“ฆ 3. Feature Distribution & Skew

๐Ÿ”น Histograms & KDEs

sns.histplot(data=df, x='feature', kde=True)

โœ”๏ธ Flag skewed variables for transformation โœ”๏ธ Right-skewed features may benefit from log scaling

๐Ÿ”น Boxplots

sns.boxplot(data=df.select_dtypes(include='number'))

โœ”๏ธ Identify univariate outliers


๐Ÿ” 4. Scaling & Normalization Prep

๐Ÿ”น Scaling Check

df.describe()  # compare mean, std, max across features

โœ”๏ธ Normalize for KMeans, DBSCAN, GMM if using distance-based models โœ”๏ธ Tree-based distance methods (rare) may skip this step

๐Ÿ”น Suggested Transformers

from sklearn.preprocessing import StandardScaler, MinMaxScaler

๐Ÿงฎ 5. Correlation + Redundancy

๐Ÿ”น Correlation Heatmap

sns.heatmap(df.corr(), cmap='coolwarm', annot=False)

โœ”๏ธ Identify strongly correlated pairs โœ”๏ธ Consider PCA or feature pruning if redundancy > 0.85


๐Ÿ” 6. Outlier Detection (Pre-Clustering)

๐Ÿ”น Z-Score or IQR-Based Filtering

from scipy.stats import zscore
np.abs(zscore(df.select_dtypes(include='number'))) > 3

โœ”๏ธ Remove/flag high outliers to avoid skewing cluster centers

๐Ÿ”น Visual Outlier Inspection

  • Boxplot
  • PCA scatter with density shading

๐ŸŒ 7. Dimensionality Reduction (for Visual Inspection)

๐Ÿ”น PCA or UMAP for 2D Shape Review

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

โœ”๏ธ Plot with color coding to visually inspect for clusters โœ”๏ธ Compare structure across scaled vs unscaled data


๐Ÿ“‹ Analyst EDA Checklist for Clustering

  • [ ] Dataset has no target label
  • [ ] Numeric and categorical features separated and reviewed
  • [ ] Distributions checked for skew, outliers, and scale variance
  • [ ] Features scaled using StandardScaler or MinMaxScaler
  • [ ] Correlation heatmap used to identify redundancy
  • [ ] PCA or UMAP previewed to inspect shape
  • [ ] Feature ranges aligned (avoid dominance by one feature)
  • [ ] Outlier strategy documented

๐Ÿ’ก Final Tip

โ€œClustering finds what you feed it โ€” scale, shape, and outliers are just as important as data volume.โ€

Use this before: K-Means, DBSCAN, GMM, HDBSCAN, Spectral, or Hierarchical clustering.