Skip to content

Outlier Action

🎯 Purpose

Use this decision card to evaluate and decide how to handle flagged outliers — based on method of detection, severity, domain impact, and modeling goals.


🧪 1. Outlier Detection Techniques

Method Use Case
Z-Score (>3) General flag for numeric features
IQR Rule (1.5×IQR) Detects extreme high/low values
Cook’s Distance Regression leverage + influence
Domain Thresholds Based on known valid ranges

⚠️ 2. Decision Matrix

Outlier Scenario Suggested Action
Z-Score between 3–5 Flag but keep (informative variation)
Z-Score > 5 or implausible value Flag or set to NaN for review
Cook’s D high & leverage high Flag row; consider robust regression
Common placeholder outliers (9999, -1) Convert to NaN and flag
Repeating junk value Normalize or treat as structured missingness
# Z-Score example
from scipy.stats import zscore
z_scores = zscore(df['income'])
df['income_outlier_flag'] = (abs(z_scores) > 3)

⚖️ 3. Remove, Cap, or Flag?

Condition Action
Small % of true outliers with domain noise Flag only
Values are invalid (negative age, zero income) Replace or drop
Feature heavily impacts model coefficients Try log transform or robust fit
Highly skewed feature Consider transformation before removal

🧰 4. Transformation Options

Transformation Use Case
np.log1p(x) For long-tailed distributions
Winsorization Cap values at upper/lower quantiles
Clipping Set min/max bounds manually

✅ Outlier Handling Checklist

  • [ ] Detection method applied (Z, IQR, Cook’s D, etc.)
  • [ ] Flag column created where needed
  • [ ] Domain review for false flags or valid edge cases
  • [ ] Strategy documented (flag/drop/transform)
  • [ ] Post-action visual recheck (boxplot, hist)

💡 Tip

“Don’t just delete outliers — investigate them. They often tell the real story.”