Skip to content

EDA Guidebook


🎯 Purpose

This guide outlines structured exploratory data analysis (EDA) steps to prepare a dataset for linear regression. It focuses on verifying assumptions, inspecting variable relationships, detecting skewness and outliers, and diagnosing multicollinearity prior to model fitting.


🧠 1. Confirm Problem Structure

  • [ ] βœ… Target variable is continuous
  • [ ] βœ… Input features are mostly numeric (or encoded)
  • [ ] βœ… No excessive missing values in target
  • [ ] βœ… Modeling goal: prediction, inference, or interpretation clarified

πŸ“Š 2. Target Variable Assessment

πŸ”Ή Distribution Check

sns.histplot(y, kde=True)
  • ⚠️ Strong skew? Consider log or Box-Cox transformation
  • πŸ” Check for outliers or multi-modal shape

πŸ”Ή Normality Tests (optional)

from scipy.stats import shapiro, normaltest
  • Normality helps inferences (but not required for prediction)

πŸ“ˆ 3. Feature-Target Relationship Checks

πŸ”Ή Correlation Matrix (Numeric)

sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
  • Identify predictors with strong linear associations to y

πŸ”Ή Scatterplots

sns.scatterplot(x=X['feature'], y=y)
  • Look for linear trend, outliers, variance changes

πŸ”Ή Linearity of Relationship

  • Plot residuals (even pre-model): LOESS trend helps detect curvature

πŸ“¦ 4. Skewness & Transformation Candidates

πŸ”Ή Assess Skew

from scipy.stats import skew
skew(X['feature'])
  • Skew > |1.0| = consider transforming (log, sqrt, etc.)

πŸ”Ή Visualize Distribution

sns.histplot(X['feature'], kde=True)
  • Use log scale or apply transformation for right-skewed features

🧹 5. Outlier Detection

πŸ”Ή Boxplots

sns.boxplot(data=X, orient='h')
  • Visually flag outliers per feature

πŸ”Ή Z-Scores (Numerical Outliers)

from scipy.stats import zscore
z = zscore(X.select_dtypes(include='number'))
  • Z > |3| = potential outlier

πŸ” 6. Multicollinearity Diagnostics

πŸ”Ή Correlation Matrix

  • Look for feature pairs > 0.85

πŸ”Ή Variance Inflation Factor (VIF)

from statsmodels.stats.outliers_influence import variance_inflation_factor
  • VIF > 5–10 = multicollinearity issue

πŸ”Ή Dimensionality Reduction (optional)

  • PCA or feature pruning for highly redundant predictors

πŸ”Ž 7. Optional Binning / Feature Engineering

  • Group continuous variables into bins if relationship is non-linear but monotonic
  • Create interaction terms if suspected synergy between features
  • Apply log transformations where variance or skew is unstable

πŸ“‹ Analyst EDA Checklist for Linear Regression

  • [ ] Target variable checked for skew and outliers
  • [ ] Features screened for missingness and skew
  • [ ] Relationships to y explored with scatterplots
  • [ ] High-cardinality categoricals reviewed or encoded
  • [ ] Correlation matrix inspected for redundancy
  • [ ] VIF calculated for multicollinearity
  • [ ] Outliers noted and handling strategy drafted

πŸ’‘ Final Tip

β€œLinear regression is sensitive to structure, not just data. Let EDA guide what transformations and diagnostics come next.”

Use this before: Fitting OLS / Ridge / Lasso models, or validating linearity assumptions visually.