EDA Guidebook
π― Purpose¶
This guide outlines structured exploratory data analysis (EDA) steps to prepare a dataset for linear regression. It focuses on verifying assumptions, inspecting variable relationships, detecting skewness and outliers, and diagnosing multicollinearity prior to model fitting.
π§ 1. Confirm Problem Structure¶
- [ ] β Target variable is continuous
- [ ] β Input features are mostly numeric (or encoded)
- [ ] β No excessive missing values in target
- [ ] β Modeling goal: prediction, inference, or interpretation clarified
π 2. Target Variable Assessment¶
πΉ Distribution Check¶
sns.histplot(y, kde=True)
- β οΈ Strong skew? Consider log or Box-Cox transformation
- π Check for outliers or multi-modal shape
πΉ Normality Tests (optional)¶
from scipy.stats import shapiro, normaltest
- Normality helps inferences (but not required for prediction)
π 3. Feature-Target Relationship Checks¶
πΉ Correlation Matrix (Numeric)¶
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
- Identify predictors with strong linear associations to
y
πΉ Scatterplots¶
sns.scatterplot(x=X['feature'], y=y)
- Look for linear trend, outliers, variance changes
πΉ Linearity of Relationship¶
- Plot residuals (even pre-model): LOESS trend helps detect curvature
π¦ 4. Skewness & Transformation Candidates¶
πΉ Assess Skew¶
from scipy.stats import skew
skew(X['feature'])
- Skew > |1.0| = consider transforming (log, sqrt, etc.)
πΉ Visualize Distribution¶
sns.histplot(X['feature'], kde=True)
- Use log scale or apply transformation for right-skewed features
π§Ή 5. Outlier Detection¶
πΉ Boxplots¶
sns.boxplot(data=X, orient='h')
- Visually flag outliers per feature
πΉ Z-Scores (Numerical Outliers)¶
from scipy.stats import zscore
z = zscore(X.select_dtypes(include='number'))
- Z > |3| = potential outlier
π 6. Multicollinearity Diagnostics¶
πΉ Correlation Matrix¶
- Look for feature pairs > 0.85
πΉ Variance Inflation Factor (VIF)¶
from statsmodels.stats.outliers_influence import variance_inflation_factor
- VIF > 5β10 = multicollinearity issue
πΉ Dimensionality Reduction (optional)¶
- PCA or feature pruning for highly redundant predictors
π 7. Optional Binning / Feature Engineering¶
- Group continuous variables into bins if relationship is non-linear but monotonic
- Create interaction terms if suspected synergy between features
- Apply log transformations where variance or skew is unstable
π Analyst EDA Checklist for Linear Regression¶
- [ ] Target variable checked for skew and outliers
- [ ] Features screened for missingness and skew
- [ ] Relationships to
y
explored with scatterplots - [ ] High-cardinality categoricals reviewed or encoded
- [ ] Correlation matrix inspected for redundancy
- [ ] VIF calculated for multicollinearity
- [ ] Outliers noted and handling strategy drafted
π‘ Final Tip¶
βLinear regression is sensitive to structure, not just data. Let EDA guide what transformations and diagnostics come next.β
Use this before: Fitting OLS / Ridge / Lasso models, or validating linearity assumptions visually.