Advanced Feature Transformation
๐ฏ Purpose¶
This guide outlines advanced transformation strategies tailored to model assumptions, statistical structure, and production constraints. It builds on foundational transformations by introducing model-aware workflows for linearity, encoding, normalization, and target stability.
๐ง 1. Model-Aware Transformation Strategy¶
Model Type | Transformation Priority |
---|---|
OLS / GLM | Linearity in features, reduced skew |
Logistic Regression | Linear logit, interpretable effects |
Ridge/Lasso | Standardized scale, no multicollinearity |
Tree-Based Models | Optional log/smooth, handle categories |
Distance-Based | Strong normalization (e.g., KMeans, SVM) |
โ๏ธ Tailor transformations to the model's core assumptions and fit behavior
๐ 2. Logit and Linearization Prep¶
๐น Log-Linear Relationships¶
import numpy as np
df['log_income'] = np.log1p(df['income'])
โ๏ธ Use log/sqrt/yeo-johnson when linearity is poor in scatterplots
๐น Box-Tidwell Interaction (Logit Linearity)¶
# test: X * log(X) โ add to logistic formula
โ๏ธ Indicates violation of logit linearity assumption
๐งฎ 3. Polynomial and Interaction Terms¶
๐น Polynomial Expansion¶
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
โ๏ธ Consider centering before expanding to reduce VIF
๐น Interaction Engineering¶
- Multiply important features (domain pairs)
- Use EDA to inform nonlinear relationships
๐ 4. Target Transformation (Regression)¶
Problem | Solution |
---|---|
Skewed target (y ) |
Log or Yeo-Johnson |
Heteroskedasticity | Log-transform to stabilize noise |
Multimodal target | Consider binning or modeling tails separately |
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
y_transformed = pt.fit_transform(y.values.reshape(-1, 1))
โ๏ธ Track reversibility for downstream interpretation
๐ฆ 5. Advanced Encoding Strategies¶
Encoding Type | Use Case |
---|---|
One-Hot | Small fixed vocab (nominal) |
Ordinal | Ranked levels (risk tiers, survey) |
Target/Impact | High-cardinality categorical |
WOE (Logistic only) | Binary target predictive encoding |
# Target Encoding Example
grouped = df.groupby('category')['target'].mean()
df['encoded'] = df['category'].map(grouped)
โ๏ธ Apply cross-validation during encoding to avoid leakage
โฑ 6. Temporal Feature Expansion¶
Strategy | Purpose |
---|---|
Lag Features | Capture recent trends |
Rolling Averages | Smooth volatility |
Expanding Windows | Cumulative behavior |
Cyclical Decomposition | Hour, day, week via sin/cos |
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
โ๏ธ Validate timezone alignment and seasonal periodicity
๐งช 7. Transformation Evaluation Tools¶
๐น Skew Reduction¶
from scipy.stats import skew
skew_before = skew(df['feature'])
skew_after = skew(np.log1p(df['feature']))
๐น Normality Diagnostic¶
- QQ plots before/after transformation
- Histogram or KDE with
log1p
,boxcox
,yeo-johnson
๐น Multicollinearity Check¶
from statsmodels.stats.outliers_influence import variance_inflation_factor
โ๏ธ Run VIF post-transformation to verify decorrelation
๐ Advanced Transformation Checklist¶
- [ ] Continuous features reviewed for skew and log-scale applicability
- [ ] Target variable normalized if appropriate (log, Box-Cox)
- [ ] Polynomial features added with centering or regularization
- [ ] Categorical encoding strategy documented and justified
- [ ] Temporal columns expanded (lags, cycles, rolling)
- [ ] Multicollinearity reviewed post-expansion (VIF)
- [ ] All transforms logged and reversible where needed
๐ก Final Tip¶
โAdvanced transformations arenโt cosmetic โ they let your model speak the language your data actually needs.โ
Use before: model fitting, residual diagnostics, or feature importance interpretation.