Next

Introduction to Model Validation

  • This lesson introduces the concept of model validation and explains why it is essential for evaluating machine learning models and preventing overfitting.
  • Why Model Validation is Important

    1. Check Generalization: Ensures the model can predict well on new, unseen data.

    2. Detect Overfitting/Underfitting: Helps identify if the model is too complex or too simple.

    3. Compare Models: Validate multiple models to choose the best one.

    4. Improve Reliability: Provides confidence that the model's performance is realistic.


    Training Error vs Testing Error

    Error Type

    Definition

    Insight

    Training Error

    Error calculated on training data

    How well the model fits the training data

    Testing Error

    Error calculated on unseen test data

    How well the model generalizes

    • Goal: Low training error and low testing error.

    Typical Scenarios

    1. Underfitting:

      • Training Error → High

      • Testing Error → High

      • Model too simple → fails to capture patterns

    2. Good Fit (Generalization):

      • Training Error → Low

      • Testing Error → Low

      • Model captures patterns without overfitting

    3. Overfitting:

      • Training Error → Very Low

      • Testing Error → High

      • Model memorizes training data → poor generalization


    Overfitting & Generalization

    Overfitting

    • Model fits training data too closely

    • Captures noise as patterns

    • Poor performance on new data

    Solution:

    • Reduce complexity

    • Use regularization (L1/L2)

    • Use more data

    • Cross-validation

    Generalization

    • Model's ability to perform well on unseen data

    • Balanced complexity → avoids overfitting & underfitting

    Visual Example (Python)

Underfitting vs Overfitting Example in Python using Polynomial Regression

This Python example demonstrates the concept of underfitting and overfitting using Polynomial Regression. The code generates a dataset, splits it into training and testing data, and trains models with different polynomial degrees (1, 3, and 15). It then calculates the Mean Squared Error (MSE) for both training and testing sets and visualizes the errors using Matplotlib to show how model complexity affects performance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate dataset
np.random.seed(0)
X = np.linspace(0, 10, 20).reshape(-1,1)
y = np.sin(X) + 0.2*np.random.randn(20,1)

# Split data
X_train, X_test = X[:15], X[15:]
y_train, y_test = y[:15], y[15:]

degrees = [1, 3, 15]  # Linear = underfit, 3 = good, 15 = overfit
train_errors, test_errors = [], []

for d in degrees:
    poly = PolynomialFeatures(degree=d)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    y_train_pred = model.predict(X_train_poly)
    y_test_pred = model.predict(X_test_poly)
    
    train_errors.append(mean_squared_error(y_train, y_train_pred))
    test_errors.append(mean_squared_error(y_test, y_test_pred))

# Plot errors
plt.plot(degrees, train_errors, marker='o', label='Training Error')
plt.plot(degrees, test_errors, marker='o', label='Testing Error')
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.title('Training vs Testing Error')
plt.legend()
plt.show()
  • Output:
Lesson image
  • Observation:

    • Degree 1 → High train & test error → Underfitting

    • Degree 3 → Low train & test error → Good generalization

    • Degree 15 → Low train error, high test error → Overfitting

Next