Outlier Action
🎯 Purpose
Use this decision card to evaluate and decide how to handle flagged outliers — based on method of detection, severity, domain impact, and modeling goals.
🧪 1. Outlier Detection Techniques
| Method |
Use Case |
| Z-Score (>3) |
General flag for numeric features |
| IQR Rule (1.5×IQR) |
Detects extreme high/low values |
| Cook’s Distance |
Regression leverage + influence |
| Domain Thresholds |
Based on known valid ranges |
⚠️ 2. Decision Matrix
| Outlier Scenario |
Suggested Action |
| Z-Score between 3–5 |
Flag but keep (informative variation) |
| Z-Score > 5 or implausible value |
Flag or set to NaN for review |
| Cook’s D high & leverage high |
Flag row; consider robust regression |
| Common placeholder outliers (9999, -1) |
Convert to NaN and flag |
| Repeating junk value |
Normalize or treat as structured missingness |
# Z-Score example
from scipy.stats import zscore
z_scores = zscore(df['income'])
df['income_outlier_flag'] = (abs(z_scores) > 3)
⚖️ 3. Remove, Cap, or Flag?
| Condition |
Action |
| Small % of true outliers with domain noise |
Flag only |
| Values are invalid (negative age, zero income) |
Replace or drop |
| Feature heavily impacts model coefficients |
Try log transform or robust fit |
| Highly skewed feature |
Consider transformation before removal |
| Transformation |
Use Case |
np.log1p(x) |
For long-tailed distributions |
| Winsorization |
Cap values at upper/lower quantiles |
| Clipping |
Set min/max bounds manually |
✅ Outlier Handling Checklist
- [ ] Detection method applied (Z, IQR, Cook’s D, etc.)
- [ ] Flag column created where needed
- [ ] Domain review for false flags or valid edge cases
- [ ] Strategy documented (flag/drop/transform)
- [ ] Post-action visual recheck (boxplot, hist)
💡 Tip
“Don’t just delete outliers — investigate them. They often tell the real story.”