Train-Test Split

  • This lesson explains how train-test split divides datasets into two parts to train machine learning models and evaluate their performance.
  • Concept of Data Splitting

    • Training Set: Used to train the model (learn patterns)

    • Testing Set: Used to evaluate model performance on unseen data

    Goal: Ensure the model generalizes well and is not overfitting to training data.


    Typical Ratios

    Split

    Use Case

    70% Train / 30% Test

    Small to medium datasets

    80% Train / 20% Test

    Medium to large datasets

    60% Train / 40% Test

    When test accuracy is critical or dataset is large

    • More training data → Better model learning

    • More test data → Better evaluation accuracy


    Random State

    • Random state ensures reproducibility of splits

    • If you don’t set it → split may vary every time

    Random State = fixed number (e.g., 42) → always get same train-test split


    When to Use Train-Test Split

    • When dataset is large enough to have representative training and test sets

    • When you want quick evaluation of model performance

    • As a first step before using advanced validation techniques like K-Fold

    Python Example

Train-Test Split Example in Python for Machine Learning

This Python example demonstrates how to split a dataset into training and testing sets using train_test_split from scikit-learn. The code creates a simple dataset, divides it into 80% training data and 20% testing data, and displays the shapes of each split. This process helps evaluate how well a machine learning model performs on unseen data.

# Step 1: Import Libraries
from sklearn.model_selection import train_test_split
import numpy as np

# Step 2: Example Dataset
X = np.arange(20).reshape(-1,1)  # Features
y = np.arange(20)                # Target

# Step 3: Split Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Display Shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
  • Output:

    X_train shape: (16, 1)

    X_test shape: (4, 1)

    y_train shape: (16,)

    y_test shape: (4,)