EDA Guidebook
π― Purpose¶
This guide outlines the essential steps for exploratory data analysis (EDA) before fitting a logistic regression model. Logistic regression models the relationship between one or more independent variables and a binary or categorical outcome.
π§ 1. Understand the Outcome Variable¶
π§ Goal¶
Confirm that your dependent variable (DV) is binary (0/1 or two distinct classes).
β EDA Tasks¶
- Use
.value_counts()
to confirm balance - Create bar plot of class frequencies
π Example¶
df['outcome'].value_counts()
sns.countplot(x='outcome', data=df)
π Notes¶
- Severe imbalance (e.g., 95%/5%) may require oversampling or synthetic data techniques like SMOTE.
π 2. Explore Categorical Predictors¶
π§ Goal¶
Examine how the distribution of categories relates to the outcome.
β EDA Tasks¶
- Create stacked bar plots of outcome by category
- Use chi-squared tests to check for association
π Example¶
pd.crosstab(df['category'], df['outcome'], normalize='index').plot(kind='bar', stacked=True)
π 3. Explore Numerical Predictors¶
π§ Goal¶
Assess whether numerical predictors differ between outcome classes.
β EDA Tasks¶
- Use histograms or KDE plots split by outcome
- Compare distributions with boxplots
- Use logistic regression scatterplots for intuition
π Example¶
sns.boxplot(x='outcome', y='age', data=df)
sns.histplot(data=df, x='income', hue='outcome', kde=True)
π 4. Linearity of the Logit (For Continuous Variables)¶
π§ Goal¶
Check that continuous predictors are linearly related to the log odds of the outcome.
β EDA Tasks¶
- Create binned line plots of predictor vs log-odds
- Use Box-Tidwell test (optional)
π Example¶
df['age_bins'] = pd.qcut(df['age'], q=10)
df.groupby('age_bins')['outcome'].mean().plot()
π Notes¶
If the relationship is non-linear, consider: - Log or square root transforms - Splines or polynomial terms
π§ͺ 5. Multicollinearity Check¶
π§ Goal¶
Avoid using highly correlated predictors that distort coefficient interpretation.
β EDA Tasks¶
- Create a correlation heatmap
- Calculate Variance Inflation Factor (VIF)
π Example¶
sns.heatmap(df.corr(), annot=True)
6.Optional: Early Visualization of Predicted Probabilities (Advanced)¶
If preliminary model predictions are available (e.g., from a baseline model or pretrained scores), you can inspect the distribution of predicted probabilities to assess class separation early.
Note: Β
Predicted probabilities are generated using the .predict_proba()
method after fitting a logistic model. Β
Typically:
proba = model.predict_proba(X)[:, 1]Β # Probability of class 1
import seaborn as sns
import matplotlib.pyplot as plt
#Assume `proba` contains pre-fitted probabilities
sns.histplot(proba, bins=30, kde=True, hue=y_true, stat='density', common_norm=False)
plt.xlabel('Predicted Probability')
plt.title('Distribution of Predicted Probabilities (Optional Early EDA)')
plt.show()
π Note: Predicted Probabilities in EDA (Advanced)¶
Important: Β
Predicted probabilities (proba
) are only available after fitting a model and calling .predict_proba(X)
. Β
They are not typically part of early EDA unless you are given pre-computed model scores or have a baseline model already trained.
To visualize model confidence early:
proba = model.predict_proba(X)[:, 1]Β # Probability for class 1
sns.histplot(proba, bins=30, kde=True, hue=y_true, stat='density', common_norm=False)
plt.axvline(x=0.5, color='red', linestyle='--', label='Default Threshold 0.5')
plt.title('Distribution of Predicted Probabilities (Optional Early EDA)')
plt.legend()
plt.show()
π 7. Feature Binning or Transformation (Optional)¶
π§ Goal¶
Enhance model performance, improve interpretability, and correct skewness or nonlinear effects that might impact logistic regression assumptions.
β Key Techniques and When to Use Them¶
πΉ 1. Feature Binning¶
- Convert continuous variables into discrete categories.
- Helps capture non-linear relationships between predictors and the log-odds of the outcome.
Use when:
import pandas as pd # Quantile-based binning (equal number of samples) df['binned_feature'] = pd.qcut(df['feature'], q=4) # Fixed-width binning df['binned_feature'] = pd.cut(df['feature'], bins=5)
- Relationship between predictor and target is not strictly linear.
- There are extreme outliers impacting interpretation.
- You want to simplify decision thresholds.
πΉ 2. Feature Transformation¶
- Reduce skewness, stabilize variance, and approximate linearity with the logit.
Common methods:
import numpy as np
# Log transformation (handles skew)
df['log_feature'] = np.log1p(df['feature'])Β # log(1 + x) for safe zero handling
π Visualizing the Effect of Binning or Transformation¶
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df['feature'], kde=True)
plt.title('Before Transformation')
sns.histplot(df['log_feature'], kde=True)
plt.title('After Log Transformation')
π Summary Table¶
Task | Tool / Method | Use When⦠|
---|---|---|
Check outcome balance | value_counts(), bar plot | Always |
Explore categorical predictors | Cross-tab, bar plots | Always (for categorical features) |
Explore numeric distributions | Boxplot, histogram, KDE | Always (for numeric features) |
Linearity of logit check | Binned log-odds plot | Advanced: numeric predictors |
Outlier detection | Boxplot, IQR filtering | Strong skews or extreme values |
Multicollinearity check | Heatmap, VIF | Always with multiple predictors |
Feature binning / transformation | qcut, cut, log, power transforms | Nonlinear or highly skewed features |
π Related Notes¶
- [[Links]]