Random Forest
π― Purpose
This QuickRef explains how to use the Random Forest Classifier β a powerful, ensemble-based method that reduces overfitting and improves predictive accuracy over single decision trees.
π¦ 1. When to Use¶
Condition | Use RF? |
---|---|
You want better generalization than a single tree | β Yes |
Mixed data types or missing values | β Yes (robust) |
Need feature importance estimates | β Yes |
Must deploy simple/explainable model | β Use shallow tree or logistic regression |
π² 2. Core Logic¶
- Builds many randomized decision trees
- Aggregates predictions via majority vote (classification)
- Reduces variance while preserving flexibility
π οΈ 3. Fitting in sklearn¶
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
model.fit(X_train, y_train)
π§ 4. Key Hyperparameters¶
Param | Description |
---|---|
n_estimators |
Number of trees in the forest |
max_depth |
Maximum depth of each tree |
max_features |
Number of features considered per split |
min_samples_split |
Min samples needed to split an internal node |
bootstrap |
Whether trees are built on bootstrap samples |
π 5. Feature Importance¶
import matplotlib.pyplot as plt
importances = model.feature_importances_
plt.barh(X.columns, importances)
βοΈ Use permutation importance or SHAP for deeper insight
β οΈ 6. Tips & Limitations¶
- Less interpretable than a single tree
- Slower to train & predict on large datasets
- Can overfit if trees are too deep or not enough data
β Checklist¶
- [ ] Class imbalance reviewed (consider
class_weight='balanced'
) - [ ]
n_estimators
tuned for performance - [ ] Tree depth + node size limited to prevent overfitting
- [ ] Feature importances visualized
- [ ] Cross-validated results confirmed
π‘ Tip¶
βRandom forests donβt overfit easily β but that doesnβt mean theyβre immune to noise.β