Cross Validation

  • This lesson explains cross validation and how it provides more reliable evaluation of machine learning models.
  • Why Cross Validation is Better than Simple Split

    • Simple Split Problem:

      • Only one train-test division → performance depends on that split

      • May overestimate or underestimate model performance

    • Cross Validation Solution:

      • Model is trained and tested on multiple subsets of data

      • Provides a more reliable and stable estimate of model performance


    Concept of Multiple Validation

    • K-Fold Cross Validation:

      • Divide dataset into K equal folds

      • Iterate K times:

        • Use K-1 folds for training

        • Use 1 fold for testing

      • Each fold serves as test set once

    • Outcome: Each data point is used for both training and testing

    Example:

    • Dataset = 100 samples, K=5 folds → each fold = 20 samples

    • Iterations:

      1. Train on 80, Test on 20

      2. Train on next 80, Test on remaining 20
        … repeat 5 times


    Average Model Performance

    • After K iterations, calculate average of all evaluation metrics (accuracy, F1, R², etc.)

    • Provides robust estimate of model’s performance

    • Reduces variance caused by random train-test splits


    Python Example: K-Fold Cross Validation

Cross Validation Example in Python using Logistic Regression

This Python example demonstrates how to evaluate a machine learning model using K-Fold Cross Validation. The code loads the Iris dataset, creates a Logistic Regression model, and performs 5-fold cross validation using cross_val_score from scikit-learn. It prints the accuracy for each fold and the average accuracy to measure the model’s overall performance.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Step 1: Load Dataset
iris = load_iris()
X, y = iris.data, iris.target

# Step 2: Create Model
model = LogisticRegression(max_iter=200)

# Step 3: Perform 5-Fold Cross Validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

# Step 4: Print Scores
print("Accuracy for each fold:", scores)
print("Average Accuracy:", scores.mean())
  • Output:

    Accuracy for each fold: [0.96666667 1.         0.93333333 0.96666667 1.        ]

    Average Accuracy: 0.9733333333333334


    • Each fold gives individual accuracy

    • Mean accuracy → final performance estimate