Skip to content

Cleaning Log Format


🎯 Purpose

This QuickRef helps you design and maintain structured logs for your cleaning and wrangling steps. Use these logs to track changes, support audits, or build reproducible pipelines.


πŸ—‚ 1. What to Log

Item Example
Schema snapshot df.dtypes.to_dict() or YAML structure
Missing value logic "filled 'age' with median + created flag"
Coercions "'price' β†’ float, 'date' β†’ datetime"
Outlier flags "Z-score > 3 β†’ flagged, not dropped"
Column drops "Dropped 'temp_id' due to >60% nulls"
Cross-field rules "start_date < end_date enforced"
Validation fails Export of row-level flags

πŸ“„ 2. YAML Format Template

dataset: customer_q2.csv
date_cleaned: 2025-05-14
cleaning_steps:
  - schema_snapshot:
      id: int64
      age: float64
      signup_date: datetime64[ns]
  - imputation:
      age: median_fill
      gender: filled with 'Missing'
  - dropped:
      - temp_id
  - type_coercion:
      price: string β†’ float
      signup_date: object β†’ datetime
  - outliers:
      z_score_threshold: 3
      flagged_fields: ['income', 'quantity']
  - cross_field_rules:
      - rule: start_date < end_date
        violations_flagged: 82 rows

βœ”οΈ Use this format for Git-tracked datasets or ML pipelines


πŸ“Š 3. Markdown Log Template (Notebook-Friendly)

### 🧼 Cleaning Log β€” customer_q2.csv
**Date:** 2025-05-14  
**Analyst:** Garrett

#### βœ… Schema
- `id`: int β†’ OK  
- `price`: object β†’ float βœ…

#### πŸ”„ Coercions
- `'price'` converted using regex clean β†’ float
- `'signup_date'` parsed as datetime

#### ❓ Missingness
- `age`: 12% null β†’ filled with median + `age_imputed` flag
- `gender`: 6% null β†’ filled with 'Missing'

#### 🚧 Outliers
- `income`, `duration` β†’ Z-score > 3 flagged

#### πŸ§ͺ Cross-Field Rules
- `start_date < end_date`: 82 violations β†’ flagged in `date_flag`

#### ❌ Dropped Columns
- `temp_id` dropped (>60% nulls)

πŸ’Ύ 4. Export + Versioning

Format Use Case
.yaml Structured logs in pipeline repos
.json API or automation integrations
.md Notebook reports / versioned snapshots
.csv Row-level QA exports

βœ… Logging Checklist

  • [ ] Schema snapshot saved
  • [ ] All coercions documented
  • [ ] Null fill methods + flags recorded
  • [ ] Outliers flagged and noted (not just removed)
  • [ ] Cross-field rules logged
  • [ ] Export format versioned and saved

πŸ’‘ Tip

β€œIf your cleaning logic isn’t logged, your dataset isn’t trustworthy.”