Skip to content

Advanced Cleaning


🎯 Purpose

Use this checklist to ensure that datasets meet integrity standards for downstream modeling, publication, or deployment. It supports rule-based validation, field logic enforcement, and audit-driven documentation.


πŸ”’ Schema Integrity

  • [ ] Column set matches schema or data dictionary
  • [ ] Dtypes match specification (e.g. datetime, category, int64)
  • [ ] Column order (if required) verified
  • [ ] Schema snapshot saved before and after cleaning

πŸ›‘ Field-Level Validation

  • [ ] All numeric bounds enforced (e.g. age ∈ [0, 120])
  • [ ] All categorical values validated against allowed set
  • [ ] Text pattern formats enforced (e.g. ZIP, phone, email regex)
  • [ ] Validation flags (*_invalid) added for each rule type

πŸ”— Cross-Field Logic

  • [ ] Temporal rules applied (e.g. start_date < end_date)
  • [ ] Conditional logic verified (e.g. if status = active β†’ qty > 0)
  • [ ] Multi-column consistency rules flagged and logged
  • [ ] Row-level valid_row or row_flag column created (optional)

πŸ§ͺ Imputation QA

  • [ ] All imputations explicitly logged
  • [ ] Imputed fields flagged with binary indicators (*_imputed_flag)
  • [ ] Strategy consistent with Part 1 or domain logic
  • [ ] Imputation documented in cleaning notes or metadata

🚨 Outlier & Exception Logging

  • [ ] Outliers reviewed per field (z-score, IQR, domain limits)
  • [ ] Outlier records exported or logged
  • [ ] Manual decisions flagged (*_override) if applicable

🧾 Audit Trail & Logging

  • [ ] Cleaning steps documented in JSON/YAML log
  • [ ] Rule violations summarized by field
  • [ ] All logic flags exported with dataset
  • [ ] Version number, timestamp, and hash recorded
  • [ ] Data Drift: Baseline distribution statistics (mean, std, null %) saved for production monitoring.

πŸ’Ύ Final Save & Review

  • [ ] Dataset saved in approved format (e.g. .csv, .parquet)
  • [ ] Flags and logs attached or merged
  • [ ] Cleaning metadata saved separately if needed

Final Tip

β€œAdvanced cleaning isn’t about removing errors β€” it’s about proving the dataset can be trusted.”