Part 2 - Advanced
π― Purpose
This guide expands on foundational cleaning by introducing validation logic, schema integrity checks, imputation flagging, cross-field rules, and audit-ready structures for deployment or high-stakes modeling. It prepares datasets for field use, deployment, or downstream risk-sensitive analysis.
π§Ύ 1. Schema and Metadata Enforcement¶
πΉ Schema Freeze and Comparison¶
expected_cols = ['id', 'name', 'age', 'start_date']
assert set(df.columns) == set(expected_cols)
βοΈ Confirm ordering, data types, and required fields
πΉ Column Audit Logging¶
schema_snapshot = df.dtypes.to_dict()
βοΈ Log pre/post cleaning schema as JSON or YAML
π¨ 2. Field-Level Validation Rules¶
Rule Type | Example |
---|---|
Categorical values | state β ['CA', 'NY', 'TX'] |
Numeric bounds | age β [0, 120] , revenue > 0 |
Text pattern match | Regex for zip codes, emails, phone |
Datetime window | start_date >= 2015 , date < today() |
df['age_flag'] = ~df['age'].between(0, 120)
βοΈ Create Boolean flags for all rule violations
π 3. Cross-Field Logic Checks¶
πΉ Logical Consistency¶
df['date_logic_flag'] = df['end_date'] < df['start_date']
df['qty_logic_flag'] = (df['ordered'] == 1) & (df['quantity'] <= 0)
βοΈ Use combined flags to create a row-level valid
indicator
π§ͺ 4. Advanced Missingness QA¶
Pattern Type | Diagnostic Tool |
---|---|
MCAR (random) | Heatmap, randomness visual check |
MAR/MNAR | Conditional probability, clustering |
Structured gaps | Compare to business rules |
πΉ Flagged Imputation (not silent)¶
df['age_imputed_flag'] = df['age'].isnull()
df['age'] = df['age'].fillna(df['age'].median())
βοΈ Always attach flag columns for audit trail
π¬ 5. Outlier Isolation + Trace Logging¶
πΉ Cooked Outlier Profiles¶
outliers = df[(df['price_z'] > 3) | (df['qty_z'] > 3)]
outliers.to_csv("outliers_flagged.csv")
βοΈ Export flagged records and annotate source βοΈ Review outliers separately with analyst or SME
π 6. Pre-Deployment Validation Pipeline¶
Checklist:¶
- [ ] Required fields not null?
- [ ] Numeric features within business bounds?
- [ ] All flags checked or explained?
- [ ] Any unexpected data types remaining?
- [ ] Versioned metadata saved?
π¦ 7. Export and Logging Strategy¶
Item | Tool / Method |
---|---|
Final cleaned dataset | .to_parquet() or .to_csv() |
Cleaning log | Dict / JSON of transformation steps |
Row-level audit trail | Include all *_flag fields |
Dataset versioning | Timestamp + hash or config YAML |
π Final QA Checklist β Part 2¶
- [ ] Schema snapshot and post-cleaning schema saved
- [ ] Validation rules passed or flagged
- [ ] Cross-field logic applied and checked
- [ ] All imputations logged with flags
- [ ] Outliers saved and reviewed
- [ ] Cleaning metadata export prepared
- [ ] Dataset certified for production or downstream use
π‘ Final Tip¶
βAdvanced cleaning means defensible data β every imputation, logic break, and outlier deserves a record.β
Use this after: Part 1 Cleaning, or before final model fit, export, or delivery.