Random Forest Classifier

  • Random Forest is an ensemble machine learning algorithm that combines multiple decision trees to produce more accurate and stable predictions.
  • Voting Mechanism

    Random Forest uses majority voting to decide the final class.

    How it Works:

    1. Multiple trees predict the class of a sample.

    2. Count votes for each class.

    3. Class with most votes is chosen.

    Example

    Suppose 5 trees predict a sample:

    Tree

    Prediction

    1

    Class A

    2

    Class B

    3

    Class A

    4

    Class A

    5

    Class B

    • Votes: Class A = 3, Class B = 2

    • Final Prediction → Class A

    This is called majority voting.


    Bagging (Bootstrap Aggregating)

    Random Forest uses Bagging to create multiple datasets for each tree.

    Steps:

    1. Create multiple random samples with replacement from the training data.

    2. Train a decision tree on each sample.

    3. Combine predictions of all trees (voting for classification).

    Why Bagging Helps?

    • Reduces variance

    • Prevents overfitting

    • Each tree sees slightly different data → more robust model


    Feature Selection (Random Subspace Method)

    Random Forest introduces random feature selection:

    • Each tree considers a random subset of features when splitting nodes

    • Not all features are used at every split

    Benefits:

    • Reduces correlation between trees

    • Increases model diversity

    • Improves generalization


    Model Stability

    Random Forest is more stable than a single Decision Tree:

    • Less sensitive to noise in data

    • Less overfitting

    • Performance is more consistent across datasets

    Example: Pass/Fail Classification

Random Forest Classification Example in Python for Student Pass/Fail Prediction

This Python example demonstrates how to use the Random Forest Classifier to predict whether a student will pass or fail based on study hours and attendance. The code creates a dataset, splits it into training and testing sets, trains a Random Forest model with multiple decision trees, and evaluates the model’s accuracy. It also displays feature importance to show which factors contribute more to the prediction.

# Step 1: Import Libraries
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Step 2: Create Dataset
X = np.array([
    [2, 50],
    [3, 60],
    [5, 80],
    [6, 90],
    [1, 40]
])  # Features: [Study Hours, Attendance]

y = np.array([0, 0, 1, 1, 0])  # Labels: 0=Fail, 1=Pass

# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Step 4: Create Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Step 5: Train Model
model.fit(X_train, y_train)

# Step 6: Predict
prediction = model.predict(X_test)
print("Prediction:", prediction)
print("Accuracy:", model.score(X_test, y_test))

# Step 7: Feature Importance
print("Feature Importance:", model.feature_importances_)
  • Output:

    Prediction: [0 1]

    Accuracy: 0.5

    Key Concepts

    Concept

    Meaning

    Voting Mechanism

    Majority voting among trees

    Bagging

    Random sampling with replacement

    Feature Selection

    Random subset of features at each split

    Model Stability

    Reduced overfitting, robust predictions