Skip to content

Data Wrangling and Validation


🎯 Purpose

This guide consolidates all critical tasks for cleaning, validating, and preparing structured and semi-structured data before analysis or modeling. It supports pipeline-ready workflows and schema-compliant data management across EDA, ML, and deployment.


πŸ“¦ 1. Initial Structural Review

πŸ”Ή Shape and Schema Alignment

df.shape  # (rows, columns)
df.columns
df.info()
  • Confirm row/column counts against expectations
  • Check column presence, order, and spelling

πŸ”Ή Basic Summary Overview

df.describe(include='all').T
  • Detect constant fields, unexpected types, empty categories

βœ”οΈ Save schema snapshot or comparison log for pipeline consistency


πŸ”’ 2. Data Type and Conversion Review

Type Action Needed
object Detect true type (category, date, ID)
float64 Check for rounding issues, missingness
int64 Watch for encoding, special flags (e.g., -999)
bool Confirm logic vs stringified booleans
datetime64 Parse formats, detect tz-aware timestamps

Snippet:

df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['flag'] = df['flag'].astype('bool')

βœ”οΈ Log all casting and dtype overrides


🧹 3. Value Normalization and Cleaning

πŸ”Ή Standard Cleaning Tasks

  • Strip whitespace
  • Lowercase strings / apply title case
  • Remove symbols ($, %, #, etc.)
  • Parse mixed-format numerics (e.g., $1,200.00 β†’ 1200.0)
df['name'] = df['name'].str.strip().str.lower()
df['cost'] = df['cost'].replace('[\$,]', '', regex=True).astype(float)

βœ”οΈ Ensure consistent formatting across columns and rows


πŸ§ͺ 4. Missing Data Review

πŸ”Ή Diagnostics

df.isnull().sum()
df.isnull().mean().sort_values(ascending=False)

πŸ”Ή Strategy Options

Case Suggested Strategy
< 5% missing Drop or fill with median/mode
> 30% missing Consider column drop
Patterned missingness (MNAR, MAR) Flag + model-based imputation
Isolated categorical gaps Fill with "Missing" or new label

βœ”οΈ Always report imputation logic in data dictionary or logs


πŸ“ 5. Validation Rule Checks

Rule Type Example
Allowed values state ∈ ['CA', 'NY', 'TX']
Value range bounds age ∈ [0, 120], revenue >= 0
Format constraints Regex for emails, phone, postal codes
Cross-field logic start_date < end_date, qty > 0 if ordered = 1

Snippet:

df['valid_age'] = df['age'].between(0, 120)
df['flag_invalid_state'] = ~df['state'].isin(valid_states)

βœ”οΈ Build reusable validation functions and triggers for logging


πŸ” 6. Categorical Review and Encoding Prep

πŸ”Ή Cardinality and Balance

df['category'].value_counts(normalize=True)
df['category'].nunique()

βœ”οΈ Group rare levels as "Other" when frequency < 1–5% βœ”οΈ Use ordinal vs one-hot vs target encoding as appropriate


πŸ“ 7. Outlier Handling and Flagging

πŸ”Ή Visual and Statistical Detection

  • Boxplots, histograms, scatterplots
  • Z-score, IQR range check, MAD (robust)
from scipy.stats import zscore
outliers = (np.abs(zscore(df['feature'])) > 3)

βœ”οΈ Create outlier flag column for traceability βœ”οΈ Avoid automatic deletion without business logic


🧾 8. Schema and Audit Logging

Task Tool / Method
Capture schema snapshot df.dtypes.to_dict()
Save pre/post cleaning diff Use deepdiff, pandas-diff
Add processing metadata Log source, date, imputer, notes

βœ”οΈ Use structured metadata tracking per versioned dataset


πŸ“‹ Master Wrangling + Validation Checklist

  • [ ] Column names + schema confirmed
  • [ ] Dtypes explicitly validated and cast
  • [ ] Missingness reviewed + handling strategy applied
  • [ ] Strings/numbers parsed + normalized
  • [ ] Outliers flagged and/or handled
  • [ ] Categorical levels grouped and encoded
  • [ ] Validation rules checked (range, format, logic)
  • [ ] Schema snapshot or cleaning log saved

πŸ’‘ Final Tip

β€œWrangling is not just cleanup β€” it’s structure, strategy, and traceability. Build once, validate always.”

Use before: Feature engineering, EDA, ML pipelines, or schema release.