Part 1 - Foundational
π― Purpose¶
This guide covers the essential, reusable tasks for turning raw or semi-structured data into an analysis-ready form. It is the first step in building clean, reliable datasets for exploratory analysis, early-stage modeling, and pipeline integration.
π¦ 1. Schema + Structure Alignment¶
πΉ Confirm dataset structure¶
df.shape # (rows, columns)
df.columns
- Validate column count, presence, and order
- Compare against expected schema or dictionary
πΉ Data types check¶
df.dtypes
df.info()
βοΈ Log non-numeric fields or object types that require conversion
π’ 2. Data Type Enforcement¶
Common fixes:¶
object β string / category
object β datetime
- Coerce numerics from symbols (
$
,%
, etc.)
Example:¶
df['price'] = pd.to_numeric(df['price'].replace('[\$,]', '', regex=True))
df['date'] = pd.to_datetime(df['date'], errors='coerce')
βοΈ Explicitly cast booleans, integers, and categories
π§Ή 3. Value Normalization¶
Text Cleanup¶
- Strip whitespace, unify case, replace typos/symbols
df['city'] = df['city'].str.strip().str.lower()
Numeric Fixes¶
- Coerce currency, percentages, bad encodings
- Handle placeholder outliers (e.g. -999)
Common Tools:¶
.str.strip(), .str.replace(), pd.to_numeric(), .astype()
π§Ύ 4. Missing Data Handling¶
πΉ Diagnostics¶
df.isnull().sum()
df.isnull().mean().sort_values(ascending=False)
πΉ Imputation Strategies¶
Type | Strategy |
---|---|
Numeric | Mean / Median fill |
Categorical | Mode or "Missing" token |
Timestamp | Use time reference or flag |
df['age'] = df['age'].fillna(df['age'].median())
df['gender'] = df['gender'].fillna('Missing')
βοΈ Track imputed fields with binary flags if needed
π 5. Outlier Detection (Light)¶
πΉ Z-score or IQR-based detection¶
from scipy.stats import zscore
z = np.abs(zscore(df.select_dtypes(include='number')))
βοΈ Flag (not drop) unless critical
π 6. Categorical Grouping¶
πΉ Group rare categories¶
freq = df['industry'].value_counts(normalize=True)
df['industry'] = df['industry'].where(freq > 0.01, 'Other')
βοΈ Use before encoding to reduce cardinality βοΈ Review business logic to preserve interpretability
π Analyst Checklist β Part 1¶
- [ ] Column names and schema matched
- [ ] Dtypes checked and enforced
- [ ] Currency / text normalized
- [ ] Missingness reviewed and imputed
- [ ] Z-score or IQR outliers flagged
- [ ] Categorical levels grouped if needed
- [ ] Dataset exported or copied to validated layer
π‘ Final Tip¶
βPart 1 cleaning is your analysis launchpad β focus on clarity, consistency, and low-risk fixes.β
Use this before: EDA, feature engineering, or modeling prep. Pair with: Validation Guide, Transformation Guide, and Cleaning Guide β Part 2.