Cleaning Log Format
π― Purpose¶
This QuickRef helps you design and maintain structured logs for your cleaning and wrangling steps. Use these logs to track changes, support audits, or build reproducible pipelines.
π 1. What to Log¶
Item | Example |
---|---|
Schema snapshot | df.dtypes.to_dict() or YAML structure |
Missing value logic | "filled 'age' with median + created flag" |
Coercions | "'price' β float, 'date' β datetime" |
Outlier flags | "Z-score > 3 β flagged, not dropped" |
Column drops | "Dropped 'temp_id' due to >60% nulls" |
Cross-field rules | "start_date < end_date enforced" |
Validation fails | Export of row-level flags |
π 2. YAML Format Template¶
dataset: customer_q2.csv
date_cleaned: 2025-05-14
cleaning_steps:
- schema_snapshot:
id: int64
age: float64
signup_date: datetime64[ns]
- imputation:
age: median_fill
gender: filled with 'Missing'
- dropped:
- temp_id
- type_coercion:
price: string β float
signup_date: object β datetime
- outliers:
z_score_threshold: 3
flagged_fields: ['income', 'quantity']
- cross_field_rules:
- rule: start_date < end_date
violations_flagged: 82 rows
βοΈ Use this format for Git-tracked datasets or ML pipelines
π 3. Markdown Log Template (Notebook-Friendly)¶
### π§Ό Cleaning Log β customer_q2.csv
**Date:** 2025-05-14
**Analyst:** Garrett
#### β
Schema
- `id`: int β OK
- `price`: object β float β
#### π Coercions
- `'price'` converted using regex clean β float
- `'signup_date'` parsed as datetime
#### β Missingness
- `age`: 12% null β filled with median + `age_imputed` flag
- `gender`: 6% null β filled with 'Missing'
#### π§ Outliers
- `income`, `duration` β Z-score > 3 flagged
#### π§ͺ Cross-Field Rules
- `start_date < end_date`: 82 violations β flagged in `date_flag`
#### β Dropped Columns
- `temp_id` dropped (>60% nulls)
πΎ 4. Export + Versioning¶
Format | Use Case |
---|---|
.yaml |
Structured logs in pipeline repos |
.json |
API or automation integrations |
.md |
Notebook reports / versioned snapshots |
.csv |
Row-level QA exports |
β Logging Checklist¶
- [ ] Schema snapshot saved
- [ ] All coercions documented
- [ ] Null fill methods + flags recorded
- [ ] Outliers flagged and noted (not just removed)
- [ ] Cross-field rules logged
- [ ] Export format versioned and saved
π‘ Tip¶
βIf your cleaning logic isnβt logged, your dataset isnβt trustworthy.β