Basic Cleaning Utilities
π― Purpose¶
This QuickRef provides fast, reusable Pandas patterns for common cleaning operations β including whitespace fixes, missing value handling, value replacement, type coercion, and row filtering.
π§Ή 1. Column Cleaning¶
# Strip whitespace from strings
df['name'] = df['name'].str.strip()
# Standardize case
df['email'] = df['email'].str.lower()
# Remove special characters
df['id_clean'] = df['id'].str.replace(r'[^A-Za-z0-9]', '', regex=True)
# Rename columns to lowercase
cols = [c.lower().strip().replace(' ', '_') for c in df.columns]
df.columns = cols
π 2. Value Replacement¶
# Replace known bad entries
df['gender'] = df['gender'].replace({'M': 'male', 'F': 'female', '': np.nan})
# Replace multiple values in one go
df.replace(['?', 'NA', 'n/a'], np.nan, inplace=True)
# Map values
df['state_code'] = df['state'].map({'IL': 1, 'WI': 2, 'MI': 3})
π§ͺ 3. Type Coercion¶
# Convert to datetime
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
# Force numeric (invalid = NaN)
df['income'] = pd.to_numeric(df['income'], errors='coerce')
# Change categorical dtype
df['category'] = df['category'].astype('category')
π§Ό 4. Missing Value Handling¶
# Count nulls
df.isnull().sum()
# Drop rows with nulls in any column
df.dropna()
# Drop rows with nulls in specific columns
df.dropna(subset=['email', 'age'])
# Fill nulls with median or constant
df['age'] = df['age'].fillna(df['age'].median())
df['city'] = df['city'].fillna('unknown')
π 5. Row Filtering + Outlier Removal¶
# Remove negative ages
df = df[df['age'] >= 0]
# Remove duplicates
df.drop_duplicates(inplace=True)
# Clip values to range
df['score'] = df['score'].clip(lower=0, upper=100)
β Cleaning Checklist¶
- [ ] Whitespace + case normalized
- [ ] Known bad strings replaced or mapped
- [ ] Types coerced safely with fallback
- [ ] Nulls filled, flagged, or dropped
- [ ] Outliers, negatives, and duplicates removed
π‘ Tip¶
βStart with a strip, replace, map, fillna, and youβll already be ahead of 80% of dirty datasets.β