Advanced Feature Transformation

🎯 Purpose¶

This guide outlines advanced transformation strategies tailored to model assumptions, statistical structure, and production constraints. It builds on foundational transformations by introducing model-aware workflows for linearity, encoding, normalization, and target stability.

🧠 1. Model-Aware Transformation Strategy¶

Model Type	Transformation Priority
OLS / GLM	Linearity in features, reduced skew
Logistic Regression	Linear logit, interpretable effects
Ridge/Lasso	Standardized scale, no multicollinearity
Tree-Based Models	Optional log/smooth, handle categories
Distance-Based	Strong normalization (e.g., KMeans, SVM)

✔️ Tailor transformations to the model's core assumptions and fit behavior

📈 2. Logit and Linearization Prep¶

🔹 Log-Linear Relationships¶

import numpy as np
df['log_income'] = np.log1p(df['income'])

✔️ Use log/sqrt/yeo-johnson when linearity is poor in scatterplots

🔹 Box-Tidwell Interaction (Logit Linearity)¶

# test: X * log(X) → add to logistic formula

✔️ Indicates violation of logit linearity assumption

🧮 3. Polynomial and Interaction Terms¶

🔹 Polynomial Expansion¶

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

✔️ Consider centering before expanding to reduce VIF

🔹 Interaction Engineering¶

Multiply important features (domain pairs)
Use EDA to inform nonlinear relationships

🔁 4. Target Transformation (Regression)¶

Problem	Solution
Skewed target (`y`)	Log or Yeo-Johnson
Heteroskedasticity	Log-transform to stabilize noise
Multimodal target	Consider binning or modeling tails separately

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
y_transformed = pt.fit_transform(y.values.reshape(-1, 1))

✔️ Track reversibility for downstream interpretation

📦 5. Advanced Encoding Strategies¶

Encoding Type	Use Case
One-Hot	Small fixed vocab (nominal)
Ordinal	Ranked levels (risk tiers, survey)
Target/Impact	High-cardinality categorical
WOE (Logistic only)	Binary target predictive encoding

# Target Encoding Example
grouped = df.groupby('category')['target'].mean()
df['encoded'] = df['category'].map(grouped)

✔️ Apply cross-validation during encoding to avoid leakage

⏱ 6. Temporal Feature Expansion¶

Strategy	Purpose
Lag Features	Capture recent trends
Rolling Averages	Smooth volatility
Expanding Windows	Cumulative behavior
Cyclical Decomposition	Hour, day, week via sin/cos

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

✔️ Validate timezone alignment and seasonal periodicity

🧪 7. Transformation Evaluation Tools¶

🔹 Skew Reduction¶

from scipy.stats import skew
skew_before = skew(df['feature'])
skew_after = skew(np.log1p(df['feature']))

🔹 Normality Diagnostic¶

QQ plots before/after transformation
Histogram or KDE with log1p, boxcox, yeo-johnson

🔹 Multicollinearity Check¶

from statsmodels.stats.outliers_influence import variance_inflation_factor

✔️ Run VIF post-transformation to verify decorrelation

📋 Advanced Transformation Checklist¶

[ ] Continuous features reviewed for skew and log-scale applicability
[ ] Target variable normalized if appropriate (log, Box-Cox)
[ ] Polynomial features added with centering or regularization
[ ] Categorical encoding strategy documented and justified
[ ] Temporal columns expanded (lags, cycles, rolling)
[ ] Multicollinearity reviewed post-expansion (VIF)
[ ] All transforms logged and reversible where needed

💡 Final Tip¶

“Advanced transformations aren’t cosmetic — they let your model speak the language your data actually needs.”

Use before: model fitting, residual diagnostics, or feature importance interpretation.