Advanced Visual EDA
๐ฏ Purpose¶
This guide provides best practices for conducting visual exploratory data analysis (EDA) before fitting a logistic regression model. The goal is to understand variable relationships, class balance, separation, and possible violations of assumptions.
๐ 1. Class Balance Check¶
๐ Count Plot or Bar Chart¶
Shows the distribution of the binary target variable (e.g., 0 vs 1).
sns.countplot(x="target", data=df)
Why: - Identify class imbalance - Important for model evaluation (e.g., precision vs. recall focus)
๐งช 2. Continuous Predictor vs Binary Outcome¶
๐ Disc Plot (a.k.a. Strip/Swarm Plot)¶
Shows how a continuous variable relates to the binary target.
sns.stripplot(x="feature", y="target", data=df, jitter=True, hue="target")
Why: - Visualize class separation - Detect non-linear patterns
๐ Box Plot / Violin Plot¶
Summarizes feature distributions per class.
sns.boxplot(x="target", y="feature", data=df)
Why: - Compare median, spread, and outliers - Spot potential predictors with clear group separation
๐ Disc Plot vs Confusion Matrix¶
A disc plot provides a feature-level preview of separation before classification. A confusion matrix summarizes final model performance. Use both to: - Anticipate separability before training - Validate if visual trends translate into predictive accuracy
๐ง 3. Feature Correlation (to Detect Multicollinearity)¶
๐ Heatmap of Correlation Matrix¶
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
Why: - Detect highly correlated predictors - Guide feature selection or regularization (Ridge, Lasso)
๐ 4. Relationships Among Features¶
๐ Pair Plot (for small feature sets)¶
Explore interactions between continuous predictors colored by class.
sns.pairplot(df, hue="target")
Why: - Understand interaction patterns - See class clustering/separation
๐ 5. Logistic Curve Fit (Optional Visual Check)¶
Fit a logistic curve between a single continuous predictor and the binary target.
sns.regplot(x="feature", y="target", data=df, logistic=True)
Why: - Assess linearity of the logit visually - Detect saturation or threshold effects
๐ Summary Table¶
Visualization | Purpose |
---|---|
Count Plot | Check class imbalance |
Disc/Strip Plot | See class separation by feature |
Box/Violin Plot | Compare distributions between classes |
Correlation Heatmap | Detect multicollinearity |
Pair Plot | Explore predictor relationships and class overlap |
Logistic Curve | Visualize logit relationship to a feature |
Disc Plot + Confusion Matrix | Compare predicted vs visual separability |
โ Next Step¶
Once you complete visual EDA: - Scale numeric features if needed - Encode categorical variables - Begin model building and check for assumptions (linearity of logit, multicollinearity)
๐ Related Notes¶
- [[Links]]