Skip to content

EDA Startup

🎯 Purpose

This QuickRef provides a concise, repeatable checklist and syntax starter for exploratory data analysis. It includes structural diagnostics, missingness checks, distribution metrics, and common statistical diagnostics β€” all optimized for notebook use.


πŸ“¦ 1. Data Snapshot & Structure

# Shape and preview
print(df.shape)
df.head()

# Info and column types
df.info()
df.dtypes

# Unique values per column
df.nunique()

βœ”οΈ Confirm row/column count matches expectations βœ”οΈ Identify object-type columns needing conversion


❓ 2. Missingness Diagnostics

# Count and percentage
missing = df.isnull().sum()
missing_pct = df.isnull().mean() * 100

# Visual (if needed)
import missingno as msno
msno.matrix(df)

βœ”οΈ Flag columns >30% null for review βœ”οΈ Consider imputation or removal if MCAR (Missing Completely At Random)


πŸ“Š 3. Descriptive Stats Summary

# Core summary
summary = df.describe(include='all').T

# Shape + outlier preview
summary[['mean', 'std', 'min', 'max']]

βœ”οΈ Review for unit mismatches, implausible values, and flat distributions


πŸ“ˆ 4. Distribution & Skew Checks

from scipy.stats import skew, kurtosis

# Example: for all numeric columns
skews = df.select_dtypes('number').apply(skew)
kurtoses = df.select_dtypes('number').apply(kurtosis)
Metric Interpretation
Skew > 1 Highly right-skewed
Skew < -1 Highly left-skewed
Kurtosis > 3 Heavy tails / outlier prone

βœ”οΈ Suggests log or Yeo-Johnson transform candidates


πŸ“ 5. Correlation Matrix

# Pearson (linear)
df.corr(numeric_only=True).round(2)

# Spearman (nonlinear/ordinal)
df.corr(method='spearman', numeric_only=True)

βœ”οΈ Identify highly correlated (>0.85) fields to review for redundancy


πŸ§ͺ 6. Outlier Detection (Stats-Based)

from scipy.stats import zscore
z = df.select_dtypes('number').apply(zscore)
outliers = (z.abs() > 3).sum()

βœ”οΈ Flag for review, but don’t remove without domain justification


βœ… Mini Summary Checklist

  • [ ] .info() and .dtypes reviewed
  • [ ] Missing data % calculated and visualized
  • [ ] Descriptive stats exported
  • [ ] Skew/kurtosis reviewed (high = transform candidate)
  • [ ] Correlation matrix reviewed
  • [ ] Outliers flagged via z-score or boxplot

πŸ’‘ Tip

β€œEDA isn’t about beautifying data β€” it’s about uncovering what you don’t know yet.”