Advanced Cleaning
π― Purpose¶
Use this checklist to ensure that datasets meet integrity standards for downstream modeling, publication, or deployment. It supports rule-based validation, field logic enforcement, and audit-driven documentation.
π Schema Integrity¶
- [ ] Column set matches schema or data dictionary
- [ ] Dtypes match specification (e.g.
datetime
,category
,int64
) - [ ] Column order (if required) verified
- [ ] Schema snapshot saved before and after cleaning
π‘ Field-Level Validation¶
- [ ] All numeric bounds enforced (e.g.
age β [0, 120]
) - [ ] All categorical values validated against allowed set
- [ ] Text pattern formats enforced (e.g. ZIP, phone, email regex)
- [ ] Validation flags (
*_invalid
) added for each rule type
π Cross-Field Logic¶
- [ ] Temporal rules applied (e.g.
start_date < end_date
) - [ ] Conditional logic verified (e.g.
if status = active β qty > 0
) - [ ] Multi-column consistency rules flagged and logged
- [ ] Row-level
valid_row
orrow_flag
column created (optional)
π§ͺ Imputation QA¶
- [ ] All imputations explicitly logged
- [ ] Imputed fields flagged with binary indicators (
*_imputed_flag
) - [ ] Strategy consistent with Part 1 or domain logic
- [ ] Imputation documented in cleaning notes or metadata
π¨ Outlier & Exception Logging¶
- [ ] Outliers reviewed per field (z-score, IQR, domain limits)
- [ ] Outlier records exported or logged
- [ ] Manual decisions flagged (
*_override
) if applicable
π§Ύ Audit Trail & Logging¶
- [ ] Cleaning steps documented in JSON/YAML log
- [ ] Rule violations summarized by field
- [ ] All logic flags exported with dataset
- [ ] Version number, timestamp, and hash recorded
- [ ] Data Drift: Baseline distribution statistics (mean, std, null %) saved for production monitoring.
πΎ Final Save & Review¶
- [ ] Dataset saved in approved format (e.g.
.csv
,.parquet
) - [ ] Flags and logs attached or merged
- [ ] Cleaning metadata saved separately if needed
Final Tip
βAdvanced cleaning isnβt about removing errors β itβs about proving the dataset can be trusted.β