Skip to content

Feature Transformation


๐ŸŽฏ Purpose

This guide outlines essential transformations to enhance model performance, enforce assumptions, and improve feature interpretability. It covers normalization, encoding, binning, polynomial expansion, and nonlinear transformations โ€” supporting regression, classification, clustering, and time series.


โš–๏ธ 1. Scaling and Normalization

Method Use When
StandardScaler For distance-based models (KMeans, SVM, PCA)
MinMaxScaler When bounded output [0, 1] is required
RobustScaler For skewed data or outliers
MaxAbsScaler Sparse data, or features already centered

Snippet:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

โœ”๏ธ Always scale after train/test split (fit only on train!)


๐Ÿงฎ 2. Encoding Categorical Features

Type Best For Notes
One-Hot Low-cardinality nominal variables Avoid with many unique categories
Ordinal Ordered categories (e.g., rating 1-5) Use with caution if order matters
Target Encoding High-cardinality labels Risk of leakage if not cross-val
Binary Hashing Textual categorical compression Good for NLP or wide tables

Snippet:

pd.get_dummies(df['color'], drop_first=True)

โœ”๏ธ Always validate post-encoding shape and memory usage


๐Ÿ“ 3. Binning and Grouping

Method Use Case
Equal-width bins Compress continuous values
Quantile bins Rank-normalized distribution
Custom bins Domain knowledge segmentation
Group rare levels Categorical reduction before encoding

Example:

pd.qcut(df['age'], q=4, labels=False)

โœ”๏ธ Useful for decision trees and interpretable dashboards


๐Ÿ” 4. Polynomial & Interaction Features

Transformation Benefit
Polynomial terms Capture curvature (e.g., xยฒ)
Interactions Capture synergy (x1 * x2)
Splines Smooth piecewise linearity
Log transforms Linearize exponential relationships

Snippet:

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

โœ”๏ธ Beware of multicollinearity and feature explosion


๐Ÿ”„ 5. Nonlinear & Power Transforms

Method Use Case
Log Right-skewed features, multiplicative
Sqrt Positive-only data, mild skew
Box-Cox Normalize + stabilize variance (positive only)
Yeo-Johnson Same as Box-Cox, but supports zero/negatives

Snippet:

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_trans = pt.fit_transform(X)

โœ”๏ธ Review skew before and after applying transformations


๐Ÿงช 6. Time Features (Temporal Modeling)

Task Transformation
Extract components Hour, Day, Month, Year
Cyclical encoding sin/cos for hour, day of week
Rolling aggregates Mean, std, min, max (time windows)
Lag features Previous value(s) of same variable

Example:

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

โœ”๏ธ Consider time zo