Data Wrangling and Validation
π― Purpose¶
This guide consolidates all critical tasks for cleaning, validating, and preparing structured and semi-structured data before analysis or modeling. It supports pipeline-ready workflows and schema-compliant data management across EDA, ML, and deployment.
π¦ 1. Initial Structural Review¶
πΉ Shape and Schema Alignment¶
df.shape # (rows, columns)
df.columns
df.info()
- Confirm row/column counts against expectations
- Check column presence, order, and spelling
πΉ Basic Summary Overview¶
df.describe(include='all').T
- Detect constant fields, unexpected types, empty categories
βοΈ Save schema snapshot or comparison log for pipeline consistency
π’ 2. Data Type and Conversion Review¶
Type | Action Needed |
---|---|
object |
Detect true type (category, date, ID) |
float64 |
Check for rounding issues, missingness |
int64 |
Watch for encoding, special flags (e.g., -999) |
bool |
Confirm logic vs stringified booleans |
datetime64 |
Parse formats, detect tz-aware timestamps |
Snippet:¶
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['flag'] = df['flag'].astype('bool')
βοΈ Log all casting and dtype overrides
π§Ή 3. Value Normalization and Cleaning¶
πΉ Standard Cleaning Tasks¶
- Strip whitespace
- Lowercase strings / apply title case
- Remove symbols (
$
,%
,#
, etc.) - Parse mixed-format numerics (e.g.,
$1,200.00
β 1200.0)
df['name'] = df['name'].str.strip().str.lower()
df['cost'] = df['cost'].replace('[\$,]', '', regex=True).astype(float)
βοΈ Ensure consistent formatting across columns and rows
π§ͺ 4. Missing Data Review¶
πΉ Diagnostics¶
df.isnull().sum()
df.isnull().mean().sort_values(ascending=False)
πΉ Strategy Options¶
Case | Suggested Strategy |
---|---|
< 5% missing | Drop or fill with median/mode |
> 30% missing | Consider column drop |
Patterned missingness (MNAR, MAR) | Flag + model-based imputation |
Isolated categorical gaps | Fill with "Missing" or new label |
βοΈ Always report imputation logic in data dictionary or logs
π 5. Validation Rule Checks¶
Rule Type | Example |
---|---|
Allowed values | state β ['CA', 'NY', 'TX'] |
Value range bounds | age β [0, 120] , revenue >= 0 |
Format constraints | Regex for emails, phone, postal codes |
Cross-field logic | start_date < end_date , qty > 0 if ordered = 1 |
Snippet:¶
df['valid_age'] = df['age'].between(0, 120)
df['flag_invalid_state'] = ~df['state'].isin(valid_states)
βοΈ Build reusable validation functions and triggers for logging
π 6. Categorical Review and Encoding Prep¶
πΉ Cardinality and Balance¶
df['category'].value_counts(normalize=True)
df['category'].nunique()
βοΈ Group rare levels as "Other" when frequency < 1β5% βοΈ Use ordinal vs one-hot vs target encoding as appropriate
π 7. Outlier Handling and Flagging¶
πΉ Visual and Statistical Detection¶
- Boxplots, histograms, scatterplots
- Z-score, IQR range check, MAD (robust)
from scipy.stats import zscore
outliers = (np.abs(zscore(df['feature'])) > 3)
βοΈ Create outlier flag column for traceability βοΈ Avoid automatic deletion without business logic
π§Ύ 8. Schema and Audit Logging¶
Task | Tool / Method |
---|---|
Capture schema snapshot | df.dtypes.to_dict() |
Save pre/post cleaning diff | Use deepdiff , pandas-diff |
Add processing metadata | Log source, date, imputer, notes |
βοΈ Use structured metadata tracking per versioned dataset
π Master Wrangling + Validation Checklist¶
- [ ] Column names + schema confirmed
- [ ] Dtypes explicitly validated and cast
- [ ] Missingness reviewed + handling strategy applied
- [ ] Strings/numbers parsed + normalized
- [ ] Outliers flagged and/or handled
- [ ] Categorical levels grouped and encoded
- [ ] Validation rules checked (range, format, logic)
- [ ] Schema snapshot or cleaning log saved
π‘ Final Tip¶
βWrangling is not just cleanup β itβs structure, strategy, and traceability. Build once, validate always.β
Use before: Feature engineering, EDA, ML pipelines, or schema release.