Skip to content

Encoding Strategy


๐ŸŽฏ Purpose

Use this card to decide how to encode categorical variables based on model type, field cardinality, and downstream interpretability needs.


๐Ÿ”ค 1. Encoding Options Overview

Encoding Method Description
One-Hot Encoding Creates binary columns for each category (no rank assumed)
Label Encoding Converts categories to numeric integers (rank implied!)
Ordinal Encoding Assigns ordered numeric values based on known category hierarchy
Binary Encoding Hashes categories into fewer binary digits (efficient for high cardinality)
Target / Mean Encoding Replaces category with target mean (use with caution, risk of leakage)

๐Ÿงญ 2. When to Use Which

Situation Recommended Encoding
Low-cardinality categorical (e.g. gender) โœ… One-Hot
Tree-based models (RF, XGB, LGBM) โœ… Label or Ordinal
Linear models (LR, Ridge) โœ… One-Hot (to avoid rank assumption)
Ordinal relationship exists (e.g. 'low', 'med', 'high') โœ… Ordinal Encoding
High-cardinality (>15 categories) โœ… Binary or Target Encoding

๐Ÿงช 3. Tooling Examples

# One-hot encoding
pd.get_dummies(df['color'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['state'])

# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['low', 'med', 'high']])
df['ordered'] = oe.fit_transform(df[['priority']])

โš ๏ธ 4. Watch For

  • Label encoding injects ordinal structure โ€” avoid with linear models
  • One-hot encoding can explode feature space if too many categories
  • Target encoding leaks target unless cross-validated or regularized

โœ… Decision Checklist

  • [ ] Cardinality of each categorical feature reviewed
  • [ ] Downstream model type considered
  • [ ] Interpretability and feature explosion balanced
  • [ ] Rank relationships respected (ordinal vs nominal)
  • [ ] Target encoding used safely (with validation splits)

๐Ÿ’ก Tip

โ€œHow you encode today decides what your model assumes tomorrow.โ€