Wrangling QA
π― Purpose¶
This QuickRef provides a compact checklist and syntax guide for identifying structural and content issues during early data wrangling. It is ideal for validating schema alignment, type consistency, and common integrity flags before cleaning or transformation.
π¦ 1. Schema Snapshot & Type Review¶
# Compare expected vs actual columns
expected_cols = ['id', 'age', 'date', 'gender']
assert set(df.columns) == set(expected_cols)
# Check types
schema = df.dtypes.to_dict()
βοΈ Export .dtypes
or save as YAML/JSON for schema audit
βοΈ Use column order enforcement only if required
π’ 2. Type Coercion Checks¶
# Numeric type coercion with fallback
pd.to_numeric(df['price'], errors='coerce')
# Convert to datetime
pd.to_datetime(df['start_date'], errors='coerce')
# Boolean flags
(df['flag_col'].astype('bool'))
βοΈ Avoid silent failures β check for nulls after coercion βοΈ Always log column conversions for reproducibility
β 3. Unexpected Values & Format Checks¶
# Unique and frequent values
for col in df.columns:
print(df[col].value_counts(dropna=False).head())
# Regex pattern match
import re
df['zipcode'].str.match(r'^\d{5}$')
βοΈ Use this to detect rogue entries, typos, and mixed formats (e.g. '12.5%', 'N/A', '-')
π§ͺ 4. Logical Rule Violations (Cross-field)¶
# Start date should be before end date
invalid_date_rows = df[df['start_date'] > df['end_date']]
# Quantity must be > 0 if order status = 'active'
df['invalid_qty'] = (df['status'] == 'active') & (df['qty'] <= 0)
βοΈ Always create a Boolean flag for downstream QA or filtering βοΈ Use to catch high-risk row-level inconsistencies
π§ 5. Dirty Data Red Flags¶
- [ ] Unexpected nulls in primary keys or required fields
- [ ] Placeholder values like
-999
,'N/A'
,'?'
- [ ] Columns with only 1 unique value (constant)
- [ ] Columns with unique values for every row (likely IDs)
- [ ] Duplicate rows or near-duplicates
π§Ύ 6. Output & Logging Strategy¶
# Save schema snapshot
schema = df.dtypes.to_dict()
# Save validation results
violations = df[['start_date', 'end_date', 'invalid_qty']]
violations.to_csv("validation_violations.csv", index=False)
βοΈ Use structured logs for reproducibility + deployment readiness
β Mini Wrangling QA Checklist¶
- [ ] Schema matches expected columns and dtypes
- [ ] All coercions reviewed and logged
- [ ] Outlier values or typos detected
- [ ] Format rules enforced (e.g., date, ID, pattern)
- [ ] Logical rules applied (cross-field)
- [ ] Cleaning/export logs updated or saved
π‘ Tip¶
βWrangling is not just about fixing β itβs about proving the data is ready to be trusted.β