Random Forest

🎯 Purpose

This QuickRef explains how to use the Random Forest Classifier — a powerful, ensemble-based method that reduces overfitting and improves predictive accuracy over single decision trees.

📦 1. When to Use¶

Condition	Use RF?
You want better generalization than a single tree	✅ Yes
Mixed data types or missing values	✅ Yes (robust)
Need feature importance estimates	✅ Yes
Must deploy simple/explainable model	❌ Use shallow tree or logistic regression

🌲 2. Core Logic¶

Builds many randomized decision trees
Aggregates predictions via majority vote (classification)
Reduces variance while preserving flexibility

🛠️ 3. Fitting in sklearn¶

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
model.fit(X_train, y_train)

🔧 4. Key Hyperparameters¶

Param	Description
`n_estimators`	Number of trees in the forest
`max_depth`	Maximum depth of each tree
`max_features`	Number of features considered per split
`min_samples_split`	Min samples needed to split an internal node
`bootstrap`	Whether trees are built on bootstrap samples

📊 5. Feature Importance¶

import matplotlib.pyplot as plt
importances = model.feature_importances_
plt.barh(X.columns, importances)

✔️ Use permutation importance or SHAP for deeper insight

⚠️ 6. Tips & Limitations¶

Less interpretable than a single tree
Slower to train & predict on large datasets
Can overfit if trees are too deep or not enough data

✅ Checklist¶

[ ] Class imbalance reviewed (consider class_weight='balanced')
[ ] n_estimators tuned for performance
[ ] Tree depth + node size limited to prevent overfitting
[ ] Feature importances visualized
[ ] Cross-validated results confirmed

💡 Tip¶

“Random forests don’t overfit easily — but that doesn’t mean they’re immune to noise.”