Skip to content

Part 2 - Advanced


🎯 Purpose

This guide expands on foundational cleaning by introducing validation logic, schema integrity checks, imputation flagging, cross-field rules, and audit-ready structures for deployment or high-stakes modeling. It prepares datasets for field use, deployment, or downstream risk-sensitive analysis.


🧾 1. Schema and Metadata Enforcement

πŸ”Ή Schema Freeze and Comparison

expected_cols = ['id', 'name', 'age', 'start_date']
assert set(df.columns) == set(expected_cols)

βœ”οΈ Confirm ordering, data types, and required fields

πŸ”Ή Column Audit Logging

schema_snapshot = df.dtypes.to_dict()

βœ”οΈ Log pre/post cleaning schema as JSON or YAML


🚨 2. Field-Level Validation Rules

Rule Type Example
Categorical values state ∈ ['CA', 'NY', 'TX']
Numeric bounds age ∈ [0, 120], revenue > 0
Text pattern match Regex for zip codes, emails, phone
Datetime window start_date >= 2015, date < today()
df['age_flag'] = ~df['age'].between(0, 120)

βœ”οΈ Create Boolean flags for all rule violations


πŸ” 3. Cross-Field Logic Checks

πŸ”Ή Logical Consistency

df['date_logic_flag'] = df['end_date'] < df['start_date']
df['qty_logic_flag'] = (df['ordered'] == 1) & (df['quantity'] <= 0)

βœ”οΈ Use combined flags to create a row-level valid indicator


πŸ§ͺ 4. Advanced Missingness QA

Pattern Type Diagnostic Tool
MCAR (random) Heatmap, randomness visual check
MAR/MNAR Conditional probability, clustering
Structured gaps Compare to business rules

πŸ”Ή Flagged Imputation (not silent)

df['age_imputed_flag'] = df['age'].isnull()
df['age'] = df['age'].fillna(df['age'].median())

βœ”οΈ Always attach flag columns for audit trail


πŸ”¬ 5. Outlier Isolation + Trace Logging

πŸ”Ή Cooked Outlier Profiles

outliers = df[(df['price_z'] > 3) | (df['qty_z'] > 3)]
outliers.to_csv("outliers_flagged.csv")

βœ”οΈ Export flagged records and annotate source βœ”οΈ Review outliers separately with analyst or SME


πŸ“‹ 6. Pre-Deployment Validation Pipeline

Checklist:

  • [ ] Required fields not null?
  • [ ] Numeric features within business bounds?
  • [ ] All flags checked or explained?
  • [ ] Any unexpected data types remaining?
  • [ ] Versioned metadata saved?

πŸ“¦ 7. Export and Logging Strategy

Item Tool / Method
Final cleaned dataset .to_parquet() or .to_csv()
Cleaning log Dict / JSON of transformation steps
Row-level audit trail Include all *_flag fields
Dataset versioning Timestamp + hash or config YAML

πŸ“‹ Final QA Checklist β€” Part 2

  • [ ] Schema snapshot and post-cleaning schema saved
  • [ ] Validation rules passed or flagged
  • [ ] Cross-field logic applied and checked
  • [ ] All imputations logged with flags
  • [ ] Outliers saved and reviewed
  • [ ] Cleaning metadata export prepared
  • [ ] Dataset certified for production or downstream use

πŸ’‘ Final Tip

β€œAdvanced cleaning means defensible data β€” every imputation, logic break, and outlier deserves a record.”

Use this after: Part 1 Cleaning, or before final model fit, export, or delivery.