Skip to content

Advanced Feature Transformation


๐ŸŽฏ Purpose

This guide outlines advanced transformation strategies tailored to model assumptions, statistical structure, and production constraints. It builds on foundational transformations by introducing model-aware workflows for linearity, encoding, normalization, and target stability.


๐Ÿง  1. Model-Aware Transformation Strategy

Model Type Transformation Priority
OLS / GLM Linearity in features, reduced skew
Logistic Regression Linear logit, interpretable effects
Ridge/Lasso Standardized scale, no multicollinearity
Tree-Based Models Optional log/smooth, handle categories
Distance-Based Strong normalization (e.g., KMeans, SVM)

โœ”๏ธ Tailor transformations to the model's core assumptions and fit behavior


๐Ÿ“ˆ 2. Logit and Linearization Prep

๐Ÿ”น Log-Linear Relationships

import numpy as np
df['log_income'] = np.log1p(df['income'])

โœ”๏ธ Use log/sqrt/yeo-johnson when linearity is poor in scatterplots

๐Ÿ”น Box-Tidwell Interaction (Logit Linearity)

# test: X * log(X) โ†’ add to logistic formula

โœ”๏ธ Indicates violation of logit linearity assumption


๐Ÿงฎ 3. Polynomial and Interaction Terms

๐Ÿ”น Polynomial Expansion

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

โœ”๏ธ Consider centering before expanding to reduce VIF

๐Ÿ”น Interaction Engineering

  • Multiply important features (domain pairs)
  • Use EDA to inform nonlinear relationships

๐Ÿ” 4. Target Transformation (Regression)

Problem Solution
Skewed target (y) Log or Yeo-Johnson
Heteroskedasticity Log-transform to stabilize noise
Multimodal target Consider binning or modeling tails separately
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
y_transformed = pt.fit_transform(y.values.reshape(-1, 1))

โœ”๏ธ Track reversibility for downstream interpretation


๐Ÿ“ฆ 5. Advanced Encoding Strategies

Encoding Type Use Case
One-Hot Small fixed vocab (nominal)
Ordinal Ranked levels (risk tiers, survey)
Target/Impact High-cardinality categorical
WOE (Logistic only) Binary target predictive encoding
# Target Encoding Example
grouped = df.groupby('category')['target'].mean()
df['encoded'] = df['category'].map(grouped)

โœ”๏ธ Apply cross-validation during encoding to avoid leakage


โฑ 6. Temporal Feature Expansion

Strategy Purpose
Lag Features Capture recent trends
Rolling Averages Smooth volatility
Expanding Windows Cumulative behavior
Cyclical Decomposition Hour, day, week via sin/cos
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

โœ”๏ธ Validate timezone alignment and seasonal periodicity


๐Ÿงช 7. Transformation Evaluation Tools

๐Ÿ”น Skew Reduction

from scipy.stats import skew
skew_before = skew(df['feature'])
skew_after = skew(np.log1p(df['feature']))

๐Ÿ”น Normality Diagnostic

  • QQ plots before/after transformation
  • Histogram or KDE with log1p, boxcox, yeo-johnson

๐Ÿ”น Multicollinearity Check

from statsmodels.stats.outliers_influence import variance_inflation_factor

โœ”๏ธ Run VIF post-transformation to verify decorrelation


๐Ÿ“‹ Advanced Transformation Checklist

  • [ ] Continuous features reviewed for skew and log-scale applicability
  • [ ] Target variable normalized if appropriate (log, Box-Cox)
  • [ ] Polynomial features added with centering or regularization
  • [ ] Categorical encoding strategy documented and justified
  • [ ] Temporal columns expanded (lags, cycles, rolling)
  • [ ] Multicollinearity reviewed post-expansion (VIF)
  • [ ] All transforms logged and reversible where needed

๐Ÿ’ก Final Tip

โ€œAdvanced transformations arenโ€™t cosmetic โ€” they let your model speak the language your data actually needs.โ€

Use before: model fitting, residual diagnostics, or feature importance interpretation.