Skip to content

EDA Guidebook


🎯 Purpose

This guidebook outlines how to prepare and explore data for binary or multiclass classification. It focuses on evaluating class structure, feature relevance, balance, and the modeling assumptions relevant to tree-based, linear, or probabilistic classifiers.


🧠 1. Confirm Problem Structure

  • [ ] βœ… Target variable is categorical
  • [ ] βœ… Task is binary or multiclass
  • [ ] βœ… Labels are clean, interpretable, and non-null
  • [ ] βœ… Goal is classification (not regression or clustering)

πŸ§ͺ 2. Class Distribution Assessment

πŸ”Ή Frequency Plot

sns.countplot(x='target', data=df)
  • Check for class imbalance
  • Consider resampling if imbalance is severe (> 90/10)

πŸ”Ή Percent Breakdown

df['target'].value_counts(normalize=True)

πŸ“Š 3. Feature Distribution by Class

πŸ”Ή Continuous Features (Numeric)

sns.boxplot(x='target', y='feature', data=df)
sns.kdeplot(data=df, x='feature', hue='target', common_norm=False)
  • Use boxplots or KDEs to compare feature separation by class
  • Assess if numeric predictors differ meaningfully between classes

πŸ”Ή Categorical Features

pd.crosstab(df['feature'], df['target'], normalize='index').plot(kind='bar', stacked=True)
  • Use stacked bar plots or heatmaps to compare class proportions

πŸ” 4. Feature Importance Exploration

πŸ”Ή Correlation Matrix (for numeric-only)

sns.heatmap(df.corr(), cmap='coolwarm', annot=False)

πŸ”Ή Chi-Squared or Mutual Info (categorical vs target)

from sklearn.feature_selection import mutual_info_classif
  • Use for early screening of features when label is discrete

πŸ“¦ 5. Missingness and Preprocessing Flags

  • [ ] Features with >30% missing reviewed
  • [ ] Categorical variables encoded or grouped
  • [ ] Cardinality checked (very high cardinality may need grouping)
  • [ ] Potential data leakage fields flagged and excluded

πŸ“ˆ 6. Linearity or Separability (for linear classifiers)

  • [ ] Basic scatterplots for top features
  • [ ] Pair plots grouped by class
  • [ ] Optional: PCA/UMAP to check if data clusters by class
from sklearn.decomposition import PCA
  • Use for visualizing 2D class separability

🧩 7. Feature Interactions (Optional)

  • [ ] Feature-feature scatterplots by class
  • [ ] Cross-feature binned grouping (e.g., feature A ranges vs feature B categories)

πŸ“‹ Analyst EDA Checklist for Classifiers

  • [ ] Target label is clean and categorical
  • [ ] Class imbalance noted (and strategy defined if needed)
  • [ ] Feature distributions assessed by class
  • [ ] Numeric features inspected for separation
  • [ ] Categorical variables analyzed for group skew
  • [ ] Key predictors and weak features identified
  • [ ] PCA/UMAP used to assess broad structure (optional)
  • [ ] Missing data and leakage reviewed

πŸ’‘ Final Tip

β€œGood classification starts with clean, interpretable classes and predictive features. Let structureβ€”not algorithmsβ€”guide model design.”

Use this before: training logistic regression, tree-based classifiers, or ensemble models.