Encoding Strategy
๐ฏ Purpose¶
Use this card to decide how to encode categorical variables based on model type, field cardinality, and downstream interpretability needs.
๐ค 1. Encoding Options Overview¶
Encoding Method | Description |
---|---|
One-Hot Encoding | Creates binary columns for each category (no rank assumed) |
Label Encoding | Converts categories to numeric integers (rank implied!) |
Ordinal Encoding | Assigns ordered numeric values based on known category hierarchy |
Binary Encoding | Hashes categories into fewer binary digits (efficient for high cardinality) |
Target / Mean Encoding | Replaces category with target mean (use with caution, risk of leakage) |
๐งญ 2. When to Use Which¶
Situation | Recommended Encoding |
---|---|
Low-cardinality categorical (e.g. gender) | โ One-Hot |
Tree-based models (RF, XGB, LGBM) | โ Label or Ordinal |
Linear models (LR, Ridge) | โ One-Hot (to avoid rank assumption) |
Ordinal relationship exists (e.g. 'low', 'med', 'high') | โ Ordinal Encoding |
High-cardinality (>15 categories) | โ Binary or Target Encoding |
๐งช 3. Tooling Examples¶
# One-hot encoding
pd.get_dummies(df['color'])
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['state'])
# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['low', 'med', 'high']])
df['ordered'] = oe.fit_transform(df[['priority']])
โ ๏ธ 4. Watch For¶
- Label encoding injects ordinal structure โ avoid with linear models
- One-hot encoding can explode feature space if too many categories
- Target encoding leaks target unless cross-validated or regularized
โ Decision Checklist¶
- [ ] Cardinality of each categorical feature reviewed
- [ ] Downstream model type considered
- [ ] Interpretability and feature explosion balanced
- [ ] Rank relationships respected (ordinal vs nominal)
- [ ] Target encoding used safely (with validation splits)
๐ก Tip¶
โHow you encode today decides what your model assumes tomorrow.โ