Random Forest Classifier
- Random Forest is an ensemble machine learning algorithm that combines multiple decision trees to produce more accurate and stable predictions.
Voting Mechanism
Random Forest uses majority voting to decide the final class.
How it Works:
Multiple trees predict the class of a sample.
Count votes for each class.
Class with most votes is chosen.
Example
Suppose 5 trees predict a sample:
Votes: Class A = 3, Class B = 2
Final Prediction → Class A
This is called majority voting.
Bagging (Bootstrap Aggregating)
Random Forest uses Bagging to create multiple datasets for each tree.
Steps:
Create multiple random samples with replacement from the training data.
Train a decision tree on each sample.
Combine predictions of all trees (voting for classification).
Why Bagging Helps?
Reduces variance
Prevents overfitting
Each tree sees slightly different data → more robust model
Feature Selection (Random Subspace Method)
Random Forest introduces random feature selection:
Each tree considers a random subset of features when splitting nodes
Not all features are used at every split
Benefits:
Reduces correlation between trees
Increases model diversity
Improves generalization
Model Stability
Random Forest is more stable than a single Decision Tree:
Less sensitive to noise in data
Less overfitting
Performance is more consistent across datasets
Example: Pass/Fail Classification
Random Forest Classification Example in Python for Student Pass/Fail Prediction
This Python example demonstrates how to use the Random Forest Classifier to predict whether a student will pass or fail based on study hours and attendance. The code creates a dataset, splits it into training and testing sets, trains a Random Forest model with multiple decision trees, and evaluates the model’s accuracy. It also displays feature importance to show which factors contribute more to the prediction.
# Step 1: Import Libraries
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Step 2: Create Dataset
X = np.array([
[2, 50],
[3, 60],
[5, 80],
[6, 90],
[1, 40]
]) # Features: [Study Hours, Attendance]
y = np.array([0, 0, 1, 1, 0]) # Labels: 0=Fail, 1=Pass
# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Step 4: Create Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Step 5: Train Model
model.fit(X_train, y_train)
# Step 6: Predict
prediction = model.predict(X_test)
print("Prediction:", prediction)
print("Accuracy:", model.score(X_test, y_test))
# Step 7: Feature Importance
print("Feature Importance:", model.feature_importances_)
Output:
Prediction: [0 1]
Accuracy: 0.5
Key Concepts