Advanced Guidebook

🌟 Purpose¶

This guidebook expands on the standard ML & Classifier Models Guidebook, focusing on advanced strategies for building, optimizing, and validating classification models within the Analyst Toolkit Vault. It is intended to deepen analyst proficiency and improve decision-making in real-world modeling scenarios.

📊 1. Model Selection Strategy¶

✅ Core Considerations:¶

Problem framing: binary, multiclass, multilabel
Data size and shape: small tabular vs large/high-dimensional
Model interpretability: white-box vs black-box preference
Computation budget: lightweight models vs ensemble stacks
Downstream usage: automation, risk scoring, reporting

🎯 Flowchart: Choosing a Classifier¶

[ ] (Add visual: binary vs multiclass ➔ interpretability needed? ➔ ensemble / kernel / neural?)
E.g.:
Binary, interpretable → Logistic Regression
Multiclass, nonlinear, scalable → Gradient Boosting
Text data, categorical → Naive Bayes / Random Forest

📉 2. Handling Class Imbalance¶

Techniques:¶

Resampling Methods:
SMOTE / Borderline-SMOTE / ADASYN
Random Oversampling / Undersampling
SMOTE + Tomek Links (combine oversampling and cleaning)
Model-Side Fixes:
Class weights (class_weight='balanced')
Custom loss weighting (in neural nets, XGBoost, LightGBM)
Adjusting decision thresholds (post-model tuning)

Diagnostics:¶

Precision-Recall Curve (better than ROC in skewed settings)
PR-AUC vs ROC-AUC
Confusion matrix heatmap with per-class accuracy
Sensitivity/specificity matrix for medical/risk contexts

⚖️ 3. Cross-Validation & Evaluation Strategy¶

Common CV Methods:¶

K-Fold CV (standard): evenly splits data, assumes i.i.d.
Stratified K-Fold: preserves class ratios in classification
Repeated Stratified K-Fold: adds robustness
TimeSeriesSplit: prevents leakage in temporal datasets
Leave-One-Out CV (LOOCV): high variance, slow but thorough

Nested CV¶

Used when both model tuning and model selection need evaluation.

from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1_macro')

Metric Strategy:¶

Use multiple metrics:

accuracy + f1_macro (imbalanced)
roc_auc_ovr for multiclass
log_loss for probability-calibrated models

⚙️ 4. Hyperparameter Tuning¶

Search Tools:¶

GridSearchCV — exhaustive, slow but thorough
RandomizedSearchCV — efficient sampling, good baseline
optuna, skopt, bayes_opt — modern Bayesian optimization

Ensemble-Aware Parameters:¶

Model Type	Important Params
Logistic	`C`, `penalty`, `solver`
Random Forest	`max_depth`, `n_estimators`, `min_samples_leaf`
Gradient Boosting	`learning_rate`, `n_estimators`, `max_depth`, `subsample`, `colsample_bytree`
SVM	`C`, `kernel`, `gamma`, `degree`
KNN	`n_neighbors`, `weights`, `metric`

Example:¶

from sklearn.model_selection import GridSearchCV
params = {'max_depth': [3, 5, 7], 'min_samples_leaf': [1, 5]}
search = GridSearchCV(RandomForestClassifier(), params, cv=5, scoring='f1_macro')
search.fit(X_train, y_train)

🔢 5. Feature Selection & Multicollinearity¶

Manual Techniques:¶

Correlation heatmaps / pairwise inspection
Drop high-VIF columns (VIF > 5 or VIF > 10)
Domain knowledge pruning

Automated Feature Selection:¶

SelectKBest, SelectPercentile
f_classif (ANOVA F-test), chi2, mutual information
RFE, RFECV (recursive feature elimination)
L1-penalized models (Logistic with penalty='l1')

Sample:¶

from sklearn.feature_selection import SelectKBest, mutual_info_classif
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X, y)

👁️ 6. Interpreting Black-Box Classifiers¶

SHAP — Shapley Additive Explanations:¶

Explains global and local prediction impact
Works on tree-based models and deep learning via KernelSHAP

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

LIME — Local Interpretable Model-agnostic Explanations:¶

Explains single predictions by approximating with interpretable surrogate

from lime.lime_tabular import LimeTabularExplainer
explainer = LimeTabularExplainer(X_train.values, feature_names=X.columns)
exp = explainer.explain_instance(X_test.iloc[0].values, model.predict_proba)
exp.show_in_notebook()

Other:¶

Permutation importance
Feature impact over time (model monitoring)

🏛️ 7. Production Readiness & Risk Control¶

Probability Calibration:¶

Ensure predict_proba values are well aligned with reality:

Platt scaling (fit a logistic model on outputs)
Isotonic regression (non-parametric)

from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(model, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)

Cutoff Selection:¶

Use ROC curve to choose operating point
Use cost-based matrix to define loss by FP/FN

Drift Monitoring:¶

Track input distribution drift (using PSI or KS test)
Track model performance decay (ROC, PR, log-loss)
Store snapshots of data profiles during training

📚 8. Reference Patterns & Notebook Templates¶

Logistic regression grid search + ROC thresholding
Random Forest with SMOTE + SHAP summary
Tree boosting pipeline with GridSearchCV + calibration
KNN with scaling, voting heatmap, PR curve
Model comparison template: ROC, PR, confusion matrix
Notebook for SHAP + permutation plots for production audit

📅 TODO:¶

[ ] Add calibration threshold visualizer template
[ ] Add classification cost matrix integration examples
[ ] Add fast model audit checklist for field deployment
[ ] Create flowchart cheat sheet: "Which Classifier Should I Use?"