Dirty Column

🎯 Purpose¶

Use this decision card to determine whether a column should be kept, cleaned, dropped, or flagged — based on its uniqueness, value patterns, format consistency, and downstream importance.

🔍 1. Uniqueness Checks¶

Column Type	Action
Only 1 unique value	Drop — provides no variance
Mostly unique values (e.g. 1 per row)	Likely an ID — exclude from modeling, retain as reference
Very few unique values (<10)	Review for categorical encoding

df.nunique().sort_values()

🔍 2. Format Problems¶

Symptom	Action
Contains mix of `%`, `$`, or strings	Normalize to numeric (use regex or `.replace`)
Long tail of weird strings (e.g. ‘N/A’, ‘–’, empty)	Coerce to `NaN`, flag for cleaning
Case and whitespace mismatches	`.str.lower().str.strip()` fixable

⚠️ 3. Dirty Data Patterns¶

Pattern	Suggestion
Placeholders like `-999`, `'missing'`, `'na'`	Convert to `np.nan` and document in log
Repeating junk (e.g. `'unknown', 'error', '—'`)	Clean or tag; consider drop if >50%
Broken date formats	Try coercion, else flag or drop

# Coerce and flag
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['bad_date_flag'] = df['date'].isnull()

⚖️ 4. Drop vs Clean Decision Rules¶

If...	Then...
Column has no variance	Drop it
>50% missing AND not critical	Drop OR document removal
Values are repairable and used in model	Clean + log
Mixed types with no clear pattern	Drop, or flag for SME review
Derived from another field (e.g. duplicate of `price_per_unit`)	Drop or document as proxy

✅ Column Review Checklist¶

[ ] Uniqueness profile reviewed
[ ] Placeholders and bad entries standardized
[ ] Mixed types normalized or removed
[ ] Date/ID fields coerced or excluded from model
[ ] Low-value or noisy columns dropped or documented

💡 Tip¶

“A dirty column might be an error — or the most important red flag in the dataset. Always check before you drop.”