Skip to content

Part 1 - Foundational


🎯 Purpose

This guide covers the essential, reusable tasks for turning raw or semi-structured data into an analysis-ready form. It is the first step in building clean, reliable datasets for exploratory analysis, early-stage modeling, and pipeline integration.


πŸ“¦ 1. Schema + Structure Alignment

πŸ”Ή Confirm dataset structure

df.shape  # (rows, columns)
df.columns
  • Validate column count, presence, and order
  • Compare against expected schema or dictionary

πŸ”Ή Data types check

df.dtypes
df.info()

βœ”οΈ Log non-numeric fields or object types that require conversion


πŸ”’ 2. Data Type Enforcement

Common fixes:

  • object β†’ string / category
  • object β†’ datetime
  • Coerce numerics from symbols ($, %, etc.)

Example:

df['price'] = pd.to_numeric(df['price'].replace('[\$,]', '', regex=True))
df['date'] = pd.to_datetime(df['date'], errors='coerce')

βœ”οΈ Explicitly cast booleans, integers, and categories


🧹 3. Value Normalization

Text Cleanup

  • Strip whitespace, unify case, replace typos/symbols
df['city'] = df['city'].str.strip().str.lower()

Numeric Fixes

  • Coerce currency, percentages, bad encodings
  • Handle placeholder outliers (e.g. -999)

Common Tools:

.str.strip(), .str.replace(), pd.to_numeric(), .astype()

🧾 4. Missing Data Handling

πŸ”Ή Diagnostics

df.isnull().sum()
df.isnull().mean().sort_values(ascending=False)

πŸ”Ή Imputation Strategies

Type Strategy
Numeric Mean / Median fill
Categorical Mode or "Missing" token
Timestamp Use time reference or flag
df['age'] = df['age'].fillna(df['age'].median())
df['gender'] = df['gender'].fillna('Missing')

βœ”οΈ Track imputed fields with binary flags if needed


πŸ“ 5. Outlier Detection (Light)

πŸ”Ή Z-score or IQR-based detection

from scipy.stats import zscore
z = np.abs(zscore(df.select_dtypes(include='number')))

βœ”οΈ Flag (not drop) unless critical


πŸ“Š 6. Categorical Grouping

πŸ”Ή Group rare categories

freq = df['industry'].value_counts(normalize=True)
df['industry'] = df['industry'].where(freq > 0.01, 'Other')

βœ”οΈ Use before encoding to reduce cardinality βœ”οΈ Review business logic to preserve interpretability


πŸ“‹ Analyst Checklist β€” Part 1

  • [ ] Column names and schema matched
  • [ ] Dtypes checked and enforced
  • [ ] Currency / text normalized
  • [ ] Missingness reviewed and imputed
  • [ ] Z-score or IQR outliers flagged
  • [ ] Categorical levels grouped if needed
  • [ ] Dataset exported or copied to validated layer

πŸ’‘ Final Tip

β€œPart 1 cleaning is your analysis launchpad β€” focus on clarity, consistency, and low-risk fixes.”

Use this before: EDA, feature engineering, or modeling prep. Pair with: Validation Guide, Transformation Guide, and Cleaning Guide β€” Part 2.