EDA Startup
π― Purpose¶
This QuickRef provides a concise, repeatable checklist and syntax starter for exploratory data analysis. It includes structural diagnostics, missingness checks, distribution metrics, and common statistical diagnostics β all optimized for notebook use.
π¦ 1. Data Snapshot & Structure¶
# Shape and preview
print(df.shape)
df.head()
# Info and column types
df.info()
df.dtypes
# Unique values per column
df.nunique()
βοΈ Confirm row/column count matches expectations βοΈ Identify object-type columns needing conversion
β 2. Missingness Diagnostics¶
# Count and percentage
missing = df.isnull().sum()
missing_pct = df.isnull().mean() * 100
# Visual (if needed)
import missingno as msno
msno.matrix(df)
βοΈ Flag columns >30% null for review βοΈ Consider imputation or removal if MCAR (Missing Completely At Random)
π 3. Descriptive Stats Summary¶
# Core summary
summary = df.describe(include='all').T
# Shape + outlier preview
summary[['mean', 'std', 'min', 'max']]
βοΈ Review for unit mismatches, implausible values, and flat distributions
π 4. Distribution & Skew Checks¶
from scipy.stats import skew, kurtosis
# Example: for all numeric columns
skews = df.select_dtypes('number').apply(skew)
kurtoses = df.select_dtypes('number').apply(kurtosis)
Metric | Interpretation |
---|---|
Skew > 1 | Highly right-skewed |
Skew < -1 | Highly left-skewed |
Kurtosis > 3 | Heavy tails / outlier prone |
βοΈ Suggests log or Yeo-Johnson transform candidates
π 5. Correlation Matrix¶
# Pearson (linear)
df.corr(numeric_only=True).round(2)
# Spearman (nonlinear/ordinal)
df.corr(method='spearman', numeric_only=True)
βοΈ Identify highly correlated (>0.85) fields to review for redundancy
π§ͺ 6. Outlier Detection (Stats-Based)¶
from scipy.stats import zscore
z = df.select_dtypes('number').apply(zscore)
outliers = (z.abs() > 3).sum()
βοΈ Flag for review, but donβt remove without domain justification
β Mini Summary Checklist¶
- [ ]
.info()
and.dtypes
reviewed - [ ] Missing data % calculated and visualized
- [ ] Descriptive stats exported
- [ ] Skew/kurtosis reviewed (high = transform candidate)
- [ ] Correlation matrix reviewed
- [ ] Outliers flagged via z-score or boxplot
π‘ Tip¶
βEDA isnβt about beautifying data β itβs about uncovering what you donβt know yet.β