Encoding Strategy
๐ฏ Purpose¶
Use this card to decide how to encode categorical variables based on model type, field cardinality, and downstream interpretability needs.
๐ค 1. Encoding Options Overview¶
| Encoding Method | Description |
|---|---|
| One-Hot Encoding | Creates binary columns for each category (no rank assumed) |
| Label Encoding | Converts categories to numeric integers (rank implied!) |
| Ordinal Encoding | Assigns ordered numeric values based on known category hierarchy |
| Binary Encoding | Hashes categories into fewer binary digits (efficient for high cardinality) |
| Target / Mean Encoding | Replaces category with target mean (use with caution, risk of leakage) |
๐งญ 2. When to Use Which¶
| Situation | Recommended Encoding |
|---|---|
| Low-cardinality categorical (e.g. gender) | โ One-Hot |
| Tree-based models (RF, XGB, LGBM) | โ Label or Ordinal |
| Linear models (LR, Ridge) | โ One-Hot (to avoid rank assumption) |
| Ordinal relationship exists (e.g. 'low', 'med', 'high') | โ Ordinal Encoding |
| High-cardinality (>15 categories) | โ Binary or Target Encoding |
๐งช 3. Tooling Examples¶
# One-hot encoding
pd.get_dummies(df['color'])
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['state'])
# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['low', 'med', 'high']])
df['ordered'] = oe.fit_transform(df[['priority']])
โ ๏ธ 4. Watch For¶
- Label encoding injects ordinal structure โ avoid with linear models
- One-hot encoding can explode feature space if too many categories
- Target encoding leaks target unless cross-validated or regularized
โ Decision Checklist¶
- [ ] Cardinality of each categorical feature reviewed
- [ ] Downstream model type considered
- [ ] Interpretability and feature explosion balanced
- [ ] Rank relationships respected (ordinal vs nominal)
- [ ] Target encoding used safely (with validation splits)
๐ก Tip¶
โHow you encode today decides what your model assumes tomorrow.โ