Skip to content

ML & Classifier Guidebook

🌟 Purpose

This guidebook documents practical machine learning strategies for classification tasks. It complements the EDA Guidebook series and supports modeling decisions, diagnostics, and deployment-readiness in your Analyst Toolkit Vault.


πŸ“† 1. Classifier Overview

Model Use Case Notes
Logistic Regression Baseline binary classification Interpretable, fast, good for audit
Naive Bayes Text, categorical-rich datasets Assumes conditional independence
K-Nearest Neighbors Small datasets, intuitive clusters Not scalable, sensitive to scaling
Decision Tree Explainable rules Prone to overfitting
Random Forest Robust baseline Better generalization, less explainable
Gradient Boosting Tabular SOTA Powerful, less interpretable
SVM High-dim linear separation Harder to scale or tune
Neural Networks Complex, large-scale patterns Requires more data, tuning, and infra

πŸ” 2. Model Fit & Evaluation Metrics

Standard Metrics

  • Accuracy
  • Precision / Recall
  • F1 Score
  • ROC-AUC
  • Log-Loss
  • Confusion Matrix

Metric Guidelines

Metric Best for
Precision Avoiding false positives
Recall Avoiding false negatives
F1 Score Balancing precision and recall
ROC-AUC Ranking power of predictions
Log-Loss Accuracy of predicted probabilities

πŸ“Š 3. Classifier Breakdowns

πŸ”· Logistic Regression

  • Assumes linear relationship between predictors and log-odds
  • Uses sigmoid function to map values to probabilities
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

Key Strengths:

  • Interpretable coefficients (odds ratios)
  • Fast and stable for binary classification

Watch for:

  • Multicollinearity (use VIF)
  • Assumes linearity in log-odds
  • Doesn’t handle non-linearity well

Best for:

  • Risk scoring, churn prediction, interpretable pipelines

πŸ”· Naive Bayes Classifier

Types of Naive Bayes

  • GaussianNB – continuous features (assumes normality)
  • MultinomialNB – count data (e.g., word counts)
  • BernoulliNB – binary features (e.g., yes/no, present/absent)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

Assumptions:

  • Features are conditionally independent
  • Each feature follows a known distribution

Best for:

  • Text classification (spam, sentiment)
  • Categorical/tabular data with low correlation

Cautions:

  • Doesn’t model feature interaction
  • Overconfident if assumptions are violated

πŸ”· K-Nearest Neighbors (KNN)

  • Distance-based voting classifier
  • No training phase β€” all work done during prediction
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Best for:

  • Clean, small datasets
  • Non-linear boundaries

Needs:

  • Scaled features (Euclidean distance)
  • Sensitivity to irrelevant features

πŸ”· Decision Trees

  • Rule-based model that recursively splits data
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

Strengths:

  • Fully interpretable
  • Handles categorical and numeric data

Watch for:

  • Overfitting on small datasets
  • Poor generalization without pruning

πŸ”· Random Forest

  • Ensemble of decision trees (bagging)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

Strengths:

  • Better generalization than single tree
  • Handles missing data well

Drawbacks:

  • Slower than logistic or NB
  • Less interpretable (needs feature importance plots)

πŸ”· Gradient Boosting (XGBoost, LightGBM)

  • Builds trees sequentially, optimizing for error
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)

Pros:

  • High accuracy
  • Works well with numeric + categorical data

Cons:

  • Less interpretable
  • Can overfit if not tuned

πŸ”· Support Vector Machine (SVM)

from sklearn.svm import SVC
model = SVC(probability=True)
model.fit(X_train, y_train)

Notes:

  • Works well in high-dim spaces
  • Needs scaling
  • Doesn’t scale well with large data

πŸ”· Neural Networks (MLPClassifier)

from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500)
model.fit(X_train, y_train)

Best for:

  • Complex patterns, nonlinear data
  • Requires large dataset + tuning

Not ideal for:

  • Interpretability
  • Small tabular datasets

πŸ”’ 4. Visual & Diagnostic Tools

Tool Purpose Notes
ROC Curve Rank-based performance (y_proba) Best for binary classifiers
Confusion Matrix Classification accuracy by class Text or heatmap format
Precision/Recall Curve Performance under imbalance Complements ROC in skewed datasets
Calibration Plot How well predicted probabilities match reality Good for risk modeling
Log-Loss Distribution Understanding prediction confidence Lower is better
SHAP / Feature Import Interpretability for black-box models Best for tree-based models

πŸ› οΈ 5. Preprocessing Needs by Model

Model Needs Scaling? Handles Categorical? Notes
Logistic Yes No Use encoding
Naive Bayes Sometimes Yes (Multinomial/Bernoulli) GaussianNB requires scaling
KNN Yes No Highly sensitive to feature scale
Decision Tree No Yes Native handling of categoricals
Random Forest No Yes Encoding optional for sklearn trees
Gradient Boosting No Yes Encode ordinal/categorical thoughtfully
SVM Yes No Requires numeric, scaled input
Neural Net (MLP) Yes No Standardize inputs for stability

πŸƒβ€β™‚οΈ Next Steps

  • [ ] Add classifier-specific runner functions (logistic, NB, tree, etc.)
  • [ ] Add diagnostic quickrefs per model (conf matrix, AUC, etc.)
  • [ ] Add SHAP + interpretability demos for tree models
  • [ ] Add flowcharts for choosing a classifier