ML & Classifier Guidebook
π Purpose¶
This guidebook documents practical machine learning strategies for classification tasks. It complements the EDA Guidebook series and supports modeling decisions, diagnostics, and deployment-readiness in your Analyst Toolkit Vault.
π 1. Classifier Overview¶
Model | Use Case | Notes |
---|---|---|
Logistic Regression | Baseline binary classification | Interpretable, fast, good for audit |
Naive Bayes | Text, categorical-rich datasets | Assumes conditional independence |
K-Nearest Neighbors | Small datasets, intuitive clusters | Not scalable, sensitive to scaling |
Decision Tree | Explainable rules | Prone to overfitting |
Random Forest | Robust baseline | Better generalization, less explainable |
Gradient Boosting | Tabular SOTA | Powerful, less interpretable |
SVM | High-dim linear separation | Harder to scale or tune |
Neural Networks | Complex, large-scale patterns | Requires more data, tuning, and infra |
π 2. Model Fit & Evaluation Metrics¶
Standard Metrics¶
- Accuracy
- Precision / Recall
- F1 Score
- ROC-AUC
- Log-Loss
- Confusion Matrix
Metric Guidelines¶
Metric | Best for |
---|---|
Precision | Avoiding false positives |
Recall | Avoiding false negatives |
F1 Score | Balancing precision and recall |
ROC-AUC | Ranking power of predictions |
Log-Loss | Accuracy of predicted probabilities |
π 3. Classifier Breakdowns¶
π· Logistic Regression¶
- Assumes linear relationship between predictors and log-odds
- Uses sigmoid function to map values to probabilities
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)
Key Strengths:
- Interpretable coefficients (odds ratios)
- Fast and stable for binary classification
Watch for:
- Multicollinearity (use VIF)
- Assumes linearity in log-odds
- Doesnβt handle non-linearity well
Best for:
- Risk scoring, churn prediction, interpretable pipelines
π· Naive Bayes Classifier¶
Types of Naive Bayes¶
- GaussianNB β continuous features (assumes normality)
- MultinomialNB β count data (e.g., word counts)
- BernoulliNB β binary features (e.g., yes/no, present/absent)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)
Assumptions:
- Features are conditionally independent
- Each feature follows a known distribution
Best for:
- Text classification (spam, sentiment)
- Categorical/tabular data with low correlation
Cautions:
- Doesnβt model feature interaction
- Overconfident if assumptions are violated
π· K-Nearest Neighbors (KNN)¶
- Distance-based voting classifier
- No training phase β all work done during prediction
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Best for:
- Clean, small datasets
- Non-linear boundaries
Needs:
- Scaled features (Euclidean distance)
- Sensitivity to irrelevant features
π· Decision Trees¶
- Rule-based model that recursively splits data
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)
Strengths:
- Fully interpretable
- Handles categorical and numeric data
Watch for:
- Overfitting on small datasets
- Poor generalization without pruning
π· Random Forest¶
- Ensemble of decision trees (bagging)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
Strengths:
- Better generalization than single tree
- Handles missing data well
Drawbacks:
- Slower than logistic or NB
- Less interpretable (needs feature importance plots)
π· Gradient Boosting (XGBoost, LightGBM)¶
- Builds trees sequentially, optimizing for error
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
Pros:
- High accuracy
- Works well with numeric + categorical data
Cons:
- Less interpretable
- Can overfit if not tuned
π· Support Vector Machine (SVM)¶
from sklearn.svm import SVC
model = SVC(probability=True)
model.fit(X_train, y_train)
Notes:
- Works well in high-dim spaces
- Needs scaling
- Doesnβt scale well with large data
π· Neural Networks (MLPClassifier)¶
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500)
model.fit(X_train, y_train)
Best for:
- Complex patterns, nonlinear data
- Requires large dataset + tuning
Not ideal for:
- Interpretability
- Small tabular datasets
π’ 4. Visual & Diagnostic Tools¶
Tool | Purpose | Notes |
---|---|---|
ROC Curve | Rank-based performance (y_proba) | Best for binary classifiers |
Confusion Matrix | Classification accuracy by class | Text or heatmap format |
Precision/Recall Curve | Performance under imbalance | Complements ROC in skewed datasets |
Calibration Plot | How well predicted probabilities match reality | Good for risk modeling |
Log-Loss Distribution | Understanding prediction confidence | Lower is better |
SHAP / Feature Import | Interpretability for black-box models | Best for tree-based models |
π οΈ 5. Preprocessing Needs by Model¶
Model | Needs Scaling? | Handles Categorical? | Notes |
---|---|---|---|
Logistic | Yes | No | Use encoding |
Naive Bayes | Sometimes | Yes (Multinomial/Bernoulli) | GaussianNB requires scaling |
KNN | Yes | No | Highly sensitive to feature scale |
Decision Tree | No | Yes | Native handling of categoricals |
Random Forest | No | Yes | Encoding optional for sklearn trees |
Gradient Boosting | No | Yes | Encode ordinal/categorical thoughtfully |
SVM | Yes | No | Requires numeric, scaled input |
Neural Net (MLP) | Yes | No | Standardize inputs for stability |
πββοΈ Next Steps¶
- [ ] Add classifier-specific runner functions (logistic, NB, tree, etc.)
- [ ] Add diagnostic quickrefs per model (conf matrix, AUC, etc.)
- [ ] Add SHAP + interpretability demos for tree models
- [ ] Add flowcharts for choosing a classifier