๐ฏ Purpose¶
This guidebook provides a comprehensive overview of exploratory data analysis (EDA) and model-specific tools for linear regression. It focuses first on foundational OLS modeling and interpretation, then expands to cover robust modifications, regularization techniques, and advanced model diagnostics. A separate companion will cover visual-only EDA and interpretation guidance.
๐ฆ Section 1: Ordinary Least Squares (OLS) Regression¶
๐ Goal¶
Model a continuous dependent variable assuming a linear relationship with predictors.
๐ Key Visuals¶
- Histogram of Residuals
- QQ Plot (normality check)
- Residuals vs Fitted Plot
- Boxplot of Residuals by Group (for categorical variables)
๐ Interpretation Checklist¶
- Rยฒ and Adjusted Rยฒ for variance explained
- Coefficients: interpret as change in predicted Y per 1-unit X
- p-values: assess significance of predictors
- Confidence intervals: assess estimate precision
๐ Assumption Diagnostics¶
Assumption | Visual Tool | What to Look For |
---|---|---|
Linearity | Scatterplot / Residuals vs Fit | Flat horizontal band |
Normality | QQ Plot / Histogram | Points on 45ยฐ line / Bell shape |
Homoscedasticity | Residuals vs Fitted | Constant spread (no funnel shape) |
No Multicollinearity | Correlation matrix / VIF | Low off-diagonal correlations / VIF < 5 |
๐จ Section 2: Robust and Modified OLS Variants¶
๐ Heteroskedasticity-Consistent Standard Errors (HC1, HC2, HC3)¶
These adjustments provide robust standard errors for OLS without changing the coefficient estimates. They are useful when the assumption of homoscedasticity (constant variance) is violated.
โ Purpose:¶
- Maintain valid inference (p-values, confidence intervals) under heteroskedasticity
๐ข Common Variants:¶
Method | Description |
---|---|
HC0 | Whiteโs original robust estimator |
HC1 | Scaled HC0 (accounts for degrees of freedom) |
HC2 | Adjusts based on leverage values (better for small n) |
HC3 | Stronger correction, like jackknife; often recommended |
๐ Example (Python):¶
model = sm.OLS(y, X).fit()
robust_model = model.get_robustcov_results(cov_type='HC3')
print(robust_model.summary())
โ๏ธ Useful in applied work when using OLS but facing heteroskedastic residuals.
๐ Robust Linear Models (RLM)¶
- Use when outliers or non-normal residuals distort OLS
- Fit via Huber loss or M-estimators
๐ Visual Tools¶
- Influence Plot (Cookโs Distance)
- Leverage vs Residuals Plot
- Comparison of OLS vs RLM fits
๐ Interpretation:¶
- Coefficients are less sensitive to outliers
- Use to validate the stability of findings from OLS
๐ Weighted Least Squares (WLS)¶
- Use when variance of residuals is not constant
- Weights reduce impact of high-variance points
๐ฅ Section 3: Regularization Techniques (Ridge, Lasso, and Elastic Net)¶
### ๐ฏ Purpose
Shrink and regularize coefficients to avoid overfitting or reduce redundancy.
Method | Penalty | Effect |
---|---|---|
Ridge | $\lambda \sum \beta^2$ | Shrinks all coefficients |
Lasso | $\lambda \sum |\beta|$ | Shrinks and selects (forces some to 0) |
Elastic Net | $\lambda\left[ \alpha \sum |\beta| + (1 - \alpha) \sum \beta^2 \right]$ | Mixes Ridge and Lasso |
๐ Visual Tools¶
-
Coefficient Path Plot
-
Feature Importance Ranking
-
RMSE vs Alpha (validation curve)
โ When to Use¶
- Many correlated predictors
- Concern for overfitting
- Preference for model parsimony (Lasso)
Examples¶
Elastic Net Example (Python):
from sklearn.linear_model import ElasticNetCV
model = ElasticNetCV(l1_ratio=[.1, .5, .9], alphas=[0.01, 0.1, 1.0], cv=5).fit(X, y)
Ridge Example (Python):
from sklearn.linear_model import RidgeCV
model = RidgeCV(alphas=[0.01, 0.1, 1.0], cv=5).fit(X, y)
Lasso Example (Python):
from sklearn.linear_model import LassoCV
model = LassoCV(alphas=[0.01, 0.1, 1.0], cv=5).fit(X, y)
๐งช Section 4: Assumption Tests¶
Assumption tests help validate the reliability of linear regression results beyond visual inspections. These tests provide statistical evidence for model trustworthiness.
๐งช Normality of Residuals¶
- Shapiro-Wilk Test: Tests if residuals come from a normal distribution
- Kolmogorov-Smirnov Test: Compares empirical distribution with theoretical normal
- Anderson-Darling Test: More sensitive to tail behavior
Python Example:
from scipy.stats import shapiro, kstest
shapiro(residuals)
kstest(residuals, 'norm')
๐ Homoscedasticity (Equal Variance)¶
- Breusch-Pagan Test: Tests if variance of residuals depends on predictors
- White Test: General test for heteroskedasticity (includes non-linearities)
Python Example:
from statsmodels.stats.diagnostic import het_breuschpagan
het_breuschpagan(residuals, model.model.exog)
๐ Autocorrelation of Residuals¶
- Durbin-Watson Test: Tests for first-order autocorrelation in residuals
- Ljung-Box Test: More general test for autocorrelation at multiple lags
Python Example:
from statsmodels.stats.stattools import durbin_watson
durbin_watson(residuals)
๐ When to Use¶
- To confirm visual observations (e.g., funnel shapes or curved patterns)
- Before drawing conclusions from p-values or confidence intervals
- When presenting findings to stakeholders who require rigorous validation
๐ Interpretation Guide¶
Test | Low p-value Means... |
---|---|
Shapiro/K-S | Residuals are not normally distributed |
Breusch-Pagan/White | Variance of residuals is not constant |
Durbin-Watson < 2 | Positive autocorrelation in residuals |
Assumption | Test (in addition to visual) |
---|---|
Normality | Shapiro-Wilk, Kolmogorov-Smirnov |
Heteroskedasticity | Breusch-Pagan, White test |
Autocorrelation | Durbin-Watson |
Use these to confirm visual trends are statistically significant.
๐งฎ Section 5: Residual Types¶
Residuals represent the difference between observed values and model predictions. Understanding different types of residuals helps diagnose outliers, leverage points, and model fit issues.
๐น Raw Residuals¶
- Basic difference: $e_i = y_i - \hat{y}_i$
- Used in plots like residuals vs fitted
๐น Standardized Residuals¶
- Raw residuals divided by their standard deviation
- Useful for identifying relative deviation
- Rule of thumb: $|\text{standardized residual}| > 2$ may indicate outliers
๐น Studentized Residuals¶
- Standardized with an estimate of their own variance
- More accurate than standardized residuals
- Used for outlier testing and model diagnostics
Python Example:
import statsmodels.api as sm
influence = model.get_influence()
studentized_residuals = influence.resid_studentized_internal
๐ Section 6: Interaction Terms¶
Interaction terms capture effects that emerge only when two predictors are considered together. They help detect whether the relationship between one variable and the outcome depends on another variable.
๐ง Why Use Them¶
- Uncover non-additive effects
- Model group-specific slopes
- Improve prediction in heterogeneous data
๐ How to Add Interactions¶
- In formulas: use
*
for main + interaction terms, or:
for interaction only X1 * X2
โ includesX1
,X2
, andX1:X2
X1:X2
โ includes only the interaction
๐ Example: Income by Gender¶
import statsmodels.formula.api as smf
model = smf.ols('income ~ experience * gender', data=df).fit()
๐ Visual Interpretation¶
- Grouped scatter plots with regression lines
- Simple slopes plots (visualize slope of one variable at levels of another)
- Interaction plots for categorical ร continuous combos
๐งช Model Interpretation Tips¶
- Significant interaction term โ the slope of one variable depends on the level of the other
- Always interpret interaction terms in context with their main effects
๐ Common Pitfalls¶
- Overfitting with too many interaction terms
- Multicollinearity if predictors are correlated โ consider centering variables first
Visuals: - Overlay regression lines by group - Slope difference (simple slopes plots)
๐ Section 7: Feature Transformation¶
Feature transformations are used to fix violations of linear regression assumptions and improve model interpretability and predictive performance.
๐ง When to Use¶
- Skewed distributions in the target or predictors
- Non-linear relationships with the dependent variable
- Heteroskedasticity (non-constant variance of residuals)
๐ Common Transformations¶
Transformation | When to Use | Effect |
---|---|---|
Log | Positive-skewed data, exponential growth | Compresses large values |
Square Root | Count data or moderate skew | Reduces range while preserving order |
Box-Cox | Positive, non-normal distributions | Stabilizes variance and normalizes |
Yeo-Johnson | Like Box-Cox, but supports 0 and negatives | Flexible for a broader range of data |
Z-Score Scaling | Features on different scales | Normalizes for models sensitive to scale |
Min-Max Scaling | Keep values between 0 and 1 | Preserves shape, shifts scale |
๐ Python Examples¶
import numpy as np
import pandas as pd
from sklearn.preprocessing import PowerTransformer, StandardScaler
# Log Transformation
df['log_x'] = np.log(df['x'] + 1)
# Box-Cox and Yeo-Johnson
pt = PowerTransformer(method='yeo-johnson')
df['yj_x'] = pt.fit_transform(df[['x']])
# Standardization
scaler = StandardScaler()
df['z_x'] = scaler.fit_transform(df[['x']])
๐ Visual Checks Before/After¶
- Histograms (check for skew reduction)
- Scatter plots (check linearity improvement)
- Residual vs Fitted (check for constant variance)
โ ๏ธ Notes¶
- Donโt transform test set independently โ always fit transformation on training data
- Log and Box-Cox require positive inputs โ use shifts or Yeo-Johnson when needed
- Check interpretation impact โ transformed variables may lose intuitive meaning
Examples: - Log - Square root - Box-Cox (Yeo-Johnson for 0s and negatives)
๐ Section 8: Model Evaluation and Comparison¶
๐ Visuals¶
- Actual vs Predicted Scatter Plot
- Residuals vs Predicted Values
- RMSE / MAE vs Model Complexity (degree, alpha, etc.)
๐ Core Evaluation Metrics¶
Metric | Description |
---|---|
RMSE | Root Mean Squared Error โ penalizes large errors |
MAE | Mean Absolute Error โ interpretable in same units as Y |
Rยฒ | Proportion of variance explained by model |
Adjusted Rยฒ | Adjusts for number of predictors |
AIC / BIC | Penalize complexity, used for model selection |
๐งช Train/Test Split¶
๐ฏ Purpose¶
Split your dataset into training and testing sets to evaluate how well your model generalizes to unseen data.
๐ Python Example¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
test_size=0.2
: 80/20 split
- random_state
: ensures reproducibility
๐ Evaluate on Test Set¶
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
โ Best Practices¶
- Always scale and fit only on the training set, then transform the test set
- Don't tune on the test set โ use validation sets or cross-validation
- Helps detect overfitting and underfitting
๐ Optional Extensions¶
- K-Fold Cross-Validation for small datasets
- Stratified Split if working with classification-type grouping
โ Summary Table¶
Component | Tool / Concept | Purpose |
---|---|---|
OLS Fit | Residual, QQ, Histogram | Assumption validation |
Robust Methods | RLM, WLS, HC3 | Handle outliers or variance changes |
Regularization | Ridge, Lasso, Elastic Net | Reduce overfit, enhance generalization |
Statistical Tests | Shapiro, BP, DW | Quantify assumption fit |
Residual Types | Standardized, studentized | Outlier and leverage diagnostics |
Interaction Terms | Grouped slope plots | Detect interaction effects |
Model Selection | AIC, BIC, RMSE, validation curves | Evaluate and compare models |
๐ Related Notes¶
- [[Links]]