Skip to content

Missingness Handling


🎯 Purpose

Use this card to decide how to handle missing values β€” whether to drop, impute, or flag β€” based on the type of missingness, percent nulls, and modeling implications.


πŸ” 1. Classify the Missingness

Type Description Action
MCAR Missing Completely At Random Safe to drop or mean/median impute
MAR Missing At Random (depends on other vars) Impute using correlated fields, flag
MNAR Missing Not At Random (systematic) Avoid silent fill; flag and document

βœ”οΈ Use groupby, domain logic, or visual patterns to infer MAR/MNAR


πŸ“Š 2. Threshold-Based Drop/Impute Rules

% Missing Recommended Action
< 5% Drop rows or impute (safe)
5–30% Impute with flag (especially MAR)
> 30% Assess necessity β€” drop if low importance
> 50% Usually drop unless core to target or domain
# Flag before filling
df['income_flag'] = df['income'].isnull()
df['income'] = df['income'].fillna(df['income'].median())

πŸ§ͺ 3. Strategy by Column Type

Type Preferred Strategy
Numeric Median, regression, KNN imputation
Categorical Mode, 'Missing' tag, frequency-based fill
Dates Interpolation, median by group, drop if sparse
Identifiers Never impute β€” drop or flag

βœ”οΈ Always log imputation strategy + flag when filling


βœ… Decision Checklist

  • [ ] % nulls calculated and reviewed
  • [ ] Cause of missingness identified or inferred (MCAR/MAR/MNAR)
  • [ ] Strategy documented per column
  • [ ] Imputation flags created for critical fields
  • [ ] Columns >50% missing either dropped or justified

πŸ’‘ Tip

β€œDon’t just fill in the blanks. Every null is a clue.”