Validation Logic
π― Purpose¶
This QuickRef outlines reusable logic for validating incoming datasets before modeling. Includes type checks, schema alignment, range tests, and conditional rules β suitable for notebooks or production pipelines.
π¦ 1. When to Run Validation¶
Trigger | Use Case |
---|---|
After data load | Ensure input structure and column names are as expected |
After cleaning | Confirm transformations didnβt break expectations |
Before modeling | Catch missingness, bad encodings, or type issues |
Before deployment | Validate stability of incoming data feeds |
π§ͺ 2. Schema Validation¶
expected_schema = {
'age': 'int',
'gender': 'category',
'income': 'float',
'signup_date': 'datetime64[ns]'
}
for col, expected_type in expected_schema.items():
if col not in df.columns:
logger.warning(f"Missing column: {col}")
elif not pd.api.types.is_dtype_equal(df[col].dtype, expected_type):
logger.warning(f"Type mismatch: {col}")
βοΈ Log mismatches and optionally halt based on criticality
π 3. Range Checks & Allowable Values¶
# Numeric ranges
assert df['age'].between(0, 120).all(), "Age out of bounds"
# Categorical sets
valid_genders = {'male', 'female', 'other'}
assert set(df['gender'].dropna().unique()).issubset(valid_genders)
βοΈ Flag out-of-domain entries for audit or cleaning rerun
β οΈ 4. Null + Duplication Checks¶
null_report = df.isnull().sum()
if null_report.any():
logger.info("Missing values detected:")
print(null_report[null_report > 0])
# Duplicate row check
duplicates = df.duplicated().sum()
if duplicates:
logger.warning(f"Duplicate rows: {duplicates}")
π 5. Cross-field Rules¶
# Example: signup_date must be before last_activity
assert (df['signup_date'] <= df['last_activity']).all()
βοΈ Add business logic rules per field pair or group
β Validation Checklist¶
- [ ] Schema and data types verified
- [ ] Required columns present
- [ ] Value ranges and categories confirmed
- [ ] Nulls and duplicates reviewed
- [ ] Cross-field constraints validated
- [ ] Logger or report generated for each run
π‘ Tip¶
βValidation is where silent bugs go to die β before they poison your models.β