EDA Guidebook
π― Purpose
This guidebook outlines how to prepare and explore data for binary or multiclass classification. It focuses on evaluating class structure, feature relevance, balance, and the modeling assumptions relevant to tree-based, linear, or probabilistic classifiers.
π§ 1. Confirm Problem Structure¶
- [ ] β Target variable is categorical
- [ ] β Task is binary or multiclass
- [ ] β Labels are clean, interpretable, and non-null
- [ ] β Goal is classification (not regression or clustering)
π§ͺ 2. Class Distribution Assessment¶
πΉ Frequency Plot¶
sns.countplot(x='target', data=df)
- Check for class imbalance
- Consider resampling if imbalance is severe (> 90/10)
πΉ Percent Breakdown¶
df['target'].value_counts(normalize=True)
π 3. Feature Distribution by Class¶
πΉ Continuous Features (Numeric)¶
sns.boxplot(x='target', y='feature', data=df)
sns.kdeplot(data=df, x='feature', hue='target', common_norm=False)
- Use boxplots or KDEs to compare feature separation by class
- Assess if numeric predictors differ meaningfully between classes
πΉ Categorical Features¶
pd.crosstab(df['feature'], df['target'], normalize='index').plot(kind='bar', stacked=True)
- Use stacked bar plots or heatmaps to compare class proportions
π 4. Feature Importance Exploration¶
πΉ Correlation Matrix (for numeric-only)¶
sns.heatmap(df.corr(), cmap='coolwarm', annot=False)
πΉ Chi-Squared or Mutual Info (categorical vs target)¶
from sklearn.feature_selection import mutual_info_classif
- Use for early screening of features when label is discrete
π¦ 5. Missingness and Preprocessing Flags¶
- [ ] Features with >30% missing reviewed
- [ ] Categorical variables encoded or grouped
- [ ] Cardinality checked (very high cardinality may need grouping)
- [ ] Potential data leakage fields flagged and excluded
π 6. Linearity or Separability (for linear classifiers)¶
- [ ] Basic scatterplots for top features
- [ ] Pair plots grouped by class
- [ ] Optional: PCA/UMAP to check if data clusters by class
from sklearn.decomposition import PCA
- Use for visualizing 2D class separability
π§© 7. Feature Interactions (Optional)¶
- [ ] Feature-feature scatterplots by class
- [ ] Cross-feature binned grouping (e.g., feature A ranges vs feature B categories)
π Analyst EDA Checklist for Classifiers¶
- [ ] Target label is clean and categorical
- [ ] Class imbalance noted (and strategy defined if needed)
- [ ] Feature distributions assessed by class
- [ ] Numeric features inspected for separation
- [ ] Categorical variables analyzed for group skew
- [ ] Key predictors and weak features identified
- [ ] PCA/UMAP used to assess broad structure (optional)
- [ ] Missing data and leakage reviewed
π‘ Final Tip¶
βGood classification starts with clean, interpretable classes and predictive features. Let structureβnot algorithmsβguide model design.β
Use this before: training logistic regression, tree-based classifiers, or ensemble models.