Skip to content

Model Guidebook


๐ŸŽฏ Purpose

This guidebook provides a complete reference for understanding, preparing, and modeling binary outcomes using logistic regression. It includes EDA, assumption checks, model building, diagnostics, and interpretation strategies.


๐Ÿงญ 1. Problem Setup

โœ… Use Logistic Regression When:

  • Your dependent variable is binary (e.g., 0/1, Yes/No)
  • You want to estimate the probability of an event
  • Predictors can be numerical or categorical

๐Ÿ”ข Model Formula:

$$ P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \cdots + \beta_kX_k)}} $$

โœ… Use Binomial Logistic Regression When:

  • Your dependent variable is binary (e.g., 0/1, Yes/No)
  • You want to estimate the probability of success vs failure
  • This is also called binary or binomial logistic regression

๐Ÿงฎ Variants of Logistic Regression

๐Ÿ” Multinomial Logistic Regression

  • Use when the target has 3+ unordered categories (e.g., 'Low', 'Medium', 'High')
  • Models class probabilities using softmax function
  • sklearn.linear_model.LogisticRegression(multi_class='multinomial', solver='lbfgs')

๐Ÿ“Š Ordinal Logistic Regression (Proportional Odds)

  • Use when the target has ordered categories (e.g., 'Poor', 'Fair', 'Good')
  • Available via statsmodels.miscmodels.ordinal_model.OrdinalModel

๐Ÿงฎ Poisson and Negative Binomial Regression

  • Used for count outcomes (0, 1, 2...)
  • Poisson assumes equal mean and variance
  • Negative Binomial handles overdispersion (variance > mean)
  • Libraries: statsmodels.genmod.generalized_linear_model.GLM with Poisson/NegBinomial family

๐Ÿ” 2. Pre-Model EDA Checklist

  • โœ… Class balance (target variable)
  • โœ… Boxplots or KDEs of numeric features by outcome
  • โœ… Crosstabs of categorical features by outcome
  • โœ… Check linearity of logit (binned plots)
  • โœ… Multicollinearity (heatmap + VIF)

See: [[๐Ÿ“Š Advanced Visual EDA Guide for Logistic Regression]], [[๐Ÿงญ Logistic Regression EDA Quickref]]


๐Ÿ“ 3. Fit the Model

Using statsmodels

import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
model.summary()

Using sklearn

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)

๐Ÿ“ˆ 4. Interpret Output

statsmodels

Element Meaning
coef Change in log odds per unit increase in X
exp(coef) Odds ratio (effect on odds)
p-value Significance of predictor
conf. interval Range of expected effect
np.exp(model.params)  # Convert to odds ratios

sklearn

Attribute Meaning
clf.coef_ Coefficient estimates
clf.intercept_ Intercept
clf.predict() Class prediction
clf.predict_proba() Predicted probabilities

๐Ÿ“Š 5. Model Evaluation and Classification Metrics

Model evaluation in classification involves understanding not just the overall accuracy, but the balance between true positives, false positives, false negatives, and the quality of probability-based predictions. Below are the most commonly used metrics and how to interpret them in practical use.

Confusion Matrix

Used to visualize the actual vs. predicted class breakdown โ€” useful for understanding how many true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) your model produced.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
- Useful for interpreting imbalance problems or misclassification types - Should always be paired with class labels if imbalanced


๐Ÿ“‹ Important Note on .predict_proba() and Confusion Matrix

To generate a confusion matrix, you must supply class labels (0 or 1), not probabilities.

There are two ways to obtain class labels:

  • Use .predict() directly (applies default threshold of 0.5 internally)
  • Use .predict_proba() and manually apply a custom threshold

Example (Custom Threshold):

# Predict probabilities
proba = clf.predict_proba(X_test)[:, 1]

# Manually threshold probabilities to generate labels
y_pred_custom = (proba > 0.6).astype(int)

# Now you can evaluate
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_custom)
### Classification Report
Summarizes **precision**, **recall**, **F1-score**, and **support** for each class.
```python
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
- Precision: What proportion of predicted positives are truly positive? - Recall: What proportion of actual positives were identified correctly? - F1-Score: Harmonic mean of precision and recall โ€” good for imbalanced classes

Accuracy, Precision, Recall, and F1-Score

Manually extract the core evaluation metrics for reporting, threshold tuning, or optimization.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Metric Meaning When to Prioritize
Accuracy Overall percentage of correct predictions Balanced datasets with equal cost of FP/FN
Precision How many predicted positives were correct High cost of false positives (e.g., spam detection)
Recall How many actual positives were identified High cost of false negatives (e.g., cancer screening)
F1-Score Harmonic mean of precision and recall Best for imbalanced datasets or uneven class importance

๐Ÿง  Key Notes:

  • Accuracy can be misleading if classes are imbalanced.
  • Precision, Recall, and F1 are directly derived from the confusion matrix (TP, FP, FN).
  • Use F1-Score if you need a balanced tradeoff between Precision and Recall.

โœ… Use .predict() outputs (not .predict_proba()) to calculate these metrics.


ROC Curve + AUC

Used for evaluating probabilistic classifiers. Measures the trade-off between TPR (Recall) and FPR at various threshold levels.

from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
plt.plot(fpr, tpr)
- ROC Curve: Visualizes performance over all thresholds - AUC (Area Under Curve): 1.0 = perfect, 0.5 = random guessing - Ideal for comparing classifiers or tuning thresholds


Visualizing Model Confidence with Disc Plot

After fitting your logistic model, you can plot the distribution of predicted probabilities for each true class.
This helps diagnose model certainty and threshold optimization opportunities.

import seaborn as sns
import matplotlib.pyplot as plt

# Predicted probability for positive class
proba = clf.predict_proba(X_test)[:, 1]
# Disc plot
sns.histplot(proba, bins=30, kde=True, hue=y_test, stat='density', common_norm=False)
plt.axvline(x=0.5, color='red', linestyle='--', label='Default Threshold 0.5')
plt.xlabel('Predicted Probability of Class 1')
plt.title('Disc Plot of Predicted Probabilities')
plt.legend()
plt.show()

๐Ÿง  How to Interpret:

Region Meaning
Left of threshold Model predicts negative class (0)
Right of threshold Model predicts positive class (1)
Overlap of classes โš ๏ธ Model uncertainty โ€” potential misclassifications
  • Sharp separation: High model confidence

  • Heavy overlap: Possible threshold adjustment needed

๐ŸŽฏ When to Use:

  • Evaluate how confident your model is.

  • Decide whether to raise/lower threshold based on business priorities (e.g., fraud detection, medical diagnosis).


๐Ÿงช 6. Assumptions & Diagnostics

Check Tool Insight
Linearity of logit Binned log-odds plot Transformation needed?
Multicollinearity Heatmap + VIF Drop or combine predictors
Influential observations Cookโ€™s Distance Check outliers
Goodness of fit Pseudo Rยฒ / AIC Compare to null model
Classification accuracy Confusion matrix, ROC Evaluate predictive performance

๐Ÿง  7. Probability Thresholding & Predictive Strategy

Adjusting the Classification Threshold

y_pred_custom = (clf.predict_proba(X_test)[:, 1] > 0.6).astype(int)
- Customize threshold based on domain (e.g., 0.6 instead of 0.5) - Useful in imbalanced classes or risk-sensitive decisions

Visualizing Precision-Recall Tradeoff

from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba[:, 1])
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.legend()

โœ… Summary

  • Start with EDA: class balance, feature relationships
  • Confirm assumptions: multicollinearity, logit linearity
  • Fit logistic model with Logit() or LogisticRegression()
  • Interpret coefficients using odds ratios
  • Evaluate with confusion matrix, ROC, classification report
  • Adjust thresholds to meet precision/recall needs

  • [[Links]]