General Guidebook
๐ฏ Purpose¶
This guidebook extends your Exploratory Data Analysis (EDA) skills with advanced techniques to deepen your understanding of complex datasets. It focuses on practical methods to detect anomalies, understand feature interactions, and prepare data for sophisticated modeling.
๐ 1. Deepen Data Understanding¶
โ Review Basic Checks¶
Start with a quick recap of dataset structure and quality:
df.info()
df.describe(include='all')
df.isnull().sum()
๐ท๏ธ 2. Feature Engineering & Transformation¶
๐ Create New Features¶
Generate features that capture important patterns: - Date/time components (year, month, day) - Interaction terms (e.g., product of two variables) - Aggregated features (group means, counts)
๐ง Transform Variables¶
Apply transformations to improve model performance:
df['log_feature'] = np.log1p(df['feature'])
df['binned_feature'] = pd.qcut(df['feature'], q=4)
๐ 3. Advanced Distribution & Relationship Checks¶
Multivariate Relationships¶
Explore how variables interact beyond simple correlations:
sns.pairplot(df, hue='target')
sns.heatmap(df.corr(), annot=True)
Feature Interactions¶
Use group statistics and pivot tables to understand combined effects:
df.groupby(['cat1', 'cat2'])['num'].mean().unstack()
๐งช 4. Detecting Anomalies & Outliers¶
Practical Outlier Detection¶
- Use IQR or Z-score for initial checks.
- Apply Isolation Forest or Local Outlier Factor for complex patterns:
Flag and review outliers carefully before deciding on removal or correction.
from sklearn.ensemble import IsolationForest iso = IsolationForest(contamination=0.05) df['anomaly'] = iso.fit_predict(df.select_dtypes(include=[np.number]))
When to Use Advanced Methods¶
- Use Mahalanobis distance to detect multivariate outliers when variables are correlated.
- Isolation Forest works well on high-dimensional data without assuming distribution.
๐ 5. Handling Missing Data¶
Visualize Patterns¶
Identify missingness structure:
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
Imputation Strategies¶
- Simple: fill with mean, median, or mode.
- Advanced: use KNN or model-based imputations when missingness is non-random.
๐ 6. Scaling, Encoding & Dimensionality Reduction¶
Scaling¶
Standardize or normalize features when models are sensitive to scale:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.select_dtypes(include=[np.number]))
Encoding¶
Convert categorical variables to numeric: - One-hot encoding for nominal categories. - Ordinal encoding for ordered categories.
Dimensionality Reduction¶
Use PCA or t-SNE to simplify data and visualize patterns:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
components = pca.fit_transform(df_scaled)
โ Summary Checklist¶
Task | Tool / Method | Notes |
---|---|---|
Data review | info() , describe() |
Confirm types, missing data |
Feature engineering | Interaction terms, binning | Capture new patterns |
Relationships | Pairplot, heatmap, group stats | Explore variable interactions |
Outlier detection | IQR, Z-score, Isolation Forest | Detect anomalies, review carefully |
Missing data | Heatmap, imputation | Understand and handle missingness |
Scaling & encoding | StandardScaler, one-hot | Prepare data for modeling |
Dimensionality reduction | PCA, t-SNE | Simplify and visualize complex data |
Let me know if you'd like practical examples or workflows tailored to specific modeling tasks such as classification, regression, or clustering!