Outlier Action
🎯 Purpose
Use this decision card to evaluate and decide how to handle flagged outliers — based on method of detection, severity, domain impact, and modeling goals.
🧪 1. Outlier Detection Techniques
Method |
Use Case |
Z-Score (>3) |
General flag for numeric features |
IQR Rule (1.5×IQR) |
Detects extreme high/low values |
Cook’s Distance |
Regression leverage + influence |
Domain Thresholds |
Based on known valid ranges |
⚠️ 2. Decision Matrix
Outlier Scenario |
Suggested Action |
Z-Score between 3–5 |
Flag but keep (informative variation) |
Z-Score > 5 or implausible value |
Flag or set to NaN for review |
Cook’s D high & leverage high |
Flag row; consider robust regression |
Common placeholder outliers (9999, -1) |
Convert to NaN and flag |
Repeating junk value |
Normalize or treat as structured missingness |
# Z-Score example
from scipy.stats import zscore
z_scores = zscore(df['income'])
df['income_outlier_flag'] = (abs(z_scores) > 3)
⚖️ 3. Remove, Cap, or Flag?
Condition |
Action |
Small % of true outliers with domain noise |
Flag only |
Values are invalid (negative age, zero income) |
Replace or drop |
Feature heavily impacts model coefficients |
Try log transform or robust fit |
Highly skewed feature |
Consider transformation before removal |
Transformation |
Use Case |
np.log1p(x) |
For long-tailed distributions |
Winsorization |
Cap values at upper/lower quantiles |
Clipping |
Set min/max bounds manually |
✅ Outlier Handling Checklist
- [ ] Detection method applied (Z, IQR, Cook’s D, etc.)
- [ ] Flag column created where needed
- [ ] Domain review for false flags or valid edge cases
- [ ] Strategy documented (flag/drop/transform)
- [ ] Post-action visual recheck (boxplot, hist)
💡 Tip
“Don’t just delete outliers — investigate them. They often tell the real story.”