Missingness Handling
π― Purpose¶
Use this card to decide how to handle missing values β whether to drop, impute, or flag β based on the type of missingness, percent nulls, and modeling implications.
π 1. Classify the Missingness¶
Type | Description | Action |
---|---|---|
MCAR | Missing Completely At Random | Safe to drop or mean/median impute |
MAR | Missing At Random (depends on other vars) | Impute using correlated fields, flag |
MNAR | Missing Not At Random (systematic) | Avoid silent fill; flag and document |
βοΈ Use groupby, domain logic, or visual patterns to infer MAR/MNAR
π 2. Threshold-Based Drop/Impute Rules¶
% Missing | Recommended Action |
---|---|
< 5% | Drop rows or impute (safe) |
5β30% | Impute with flag (especially MAR) |
> 30% | Assess necessity β drop if low importance |
> 50% | Usually drop unless core to target or domain |
# Flag before filling
df['income_flag'] = df['income'].isnull()
df['income'] = df['income'].fillna(df['income'].median())
π§ͺ 3. Strategy by Column Type¶
Type | Preferred Strategy |
---|---|
Numeric | Median, regression, KNN imputation |
Categorical | Mode, 'Missing' tag, frequency-based fill |
Dates | Interpolation, median by group, drop if sparse |
Identifiers | Never impute β drop or flag |
βοΈ Always log imputation strategy + flag when filling
β Decision Checklist¶
- [ ] % nulls calculated and reviewed
- [ ] Cause of missingness identified or inferred (MCAR/MAR/MNAR)
- [ ] Strategy documented per column
- [ ] Imputation flags created for critical fields
- [ ] Columns >50% missing either dropped or justified
π‘ Tip¶
βDonβt just fill in the blanks. Every null is a clue.β