Feature Transformation
๐ฏ Purpose¶
This guide outlines essential transformations to enhance model performance, enforce assumptions, and improve feature interpretability. It covers normalization, encoding, binning, polynomial expansion, and nonlinear transformations โ supporting regression, classification, clustering, and time series.
โ๏ธ 1. Scaling and Normalization¶
Method | Use When |
---|---|
StandardScaler | For distance-based models (KMeans, SVM, PCA) |
MinMaxScaler | When bounded output [0, 1] is required |
RobustScaler | For skewed data or outliers |
MaxAbsScaler | Sparse data, or features already centered |
Snippet:¶
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
โ๏ธ Always scale after train/test split (fit only on train!)
๐งฎ 2. Encoding Categorical Features¶
Type | Best For | Notes |
---|---|---|
One-Hot | Low-cardinality nominal variables | Avoid with many unique categories |
Ordinal | Ordered categories (e.g., rating 1-5) | Use with caution if order matters |
Target Encoding | High-cardinality labels | Risk of leakage if not cross-val |
Binary Hashing | Textual categorical compression | Good for NLP or wide tables |
Snippet:¶
pd.get_dummies(df['color'], drop_first=True)
โ๏ธ Always validate post-encoding shape and memory usage
๐ 3. Binning and Grouping¶
Method | Use Case |
---|---|
Equal-width bins | Compress continuous values |
Quantile bins | Rank-normalized distribution |
Custom bins | Domain knowledge segmentation |
Group rare levels | Categorical reduction before encoding |
Example:¶
pd.qcut(df['age'], q=4, labels=False)
โ๏ธ Useful for decision trees and interpretable dashboards
๐ 4. Polynomial & Interaction Features¶
Transformation | Benefit |
---|---|
Polynomial terms | Capture curvature (e.g., xยฒ ) |
Interactions | Capture synergy (x1 * x2 ) |
Splines | Smooth piecewise linearity |
Log transforms | Linearize exponential relationships |
Snippet:¶
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
โ๏ธ Beware of multicollinearity and feature explosion
๐ 5. Nonlinear & Power Transforms¶
Method | Use Case |
---|---|
Log | Right-skewed features, multiplicative |
Sqrt | Positive-only data, mild skew |
Box-Cox | Normalize + stabilize variance (positive only) |
Yeo-Johnson | Same as Box-Cox, but supports zero/negatives |
Snippet:¶
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_trans = pt.fit_transform(X)
โ๏ธ Review skew before and after applying transformations
๐งช 6. Time Features (Temporal Modeling)¶
Task | Transformation |
---|---|
Extract components | Hour, Day, Month, Year |
Cyclical encoding | sin/cos for hour, day of week |
Rolling aggregates | Mean, std, min, max (time windows) |
Lag features | Previous value(s) of same variable |
Example:¶
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
โ๏ธ Consider time zo