Train-Test Split
- This lesson explains how train-test split divides datasets into two parts to train machine learning models and evaluate their performance.
Concept of Data Splitting
Training Set: Used to train the model (learn patterns)
Testing Set: Used to evaluate model performance on unseen data
Goal: Ensure the model generalizes well and is not overfitting to training data.
Typical Ratios
More training data → Better model learning
More test data → Better evaluation accuracy
Random State
Random state ensures reproducibility of splits
If you don’t set it → split may vary every time
Random State = fixed number (e.g., 42) → always get same train-test split
When to Use Train-Test Split
When dataset is large enough to have representative training and test sets
When you want quick evaluation of model performance
As a first step before using advanced validation techniques like K-Fold
Python Example
Train-Test Split Example in Python for Machine Learning
This Python example demonstrates how to split a dataset into training and testing sets using train_test_split from scikit-learn. The code creates a simple dataset, divides it into 80% training data and 20% testing data, and displays the shapes of each split. This process helps evaluate how well a machine learning model performs on unseen data.
# Step 1: Import Libraries
from sklearn.model_selection import train_test_split
import numpy as np
# Step 2: Example Dataset
X = np.arange(20).reshape(-1,1) # Features
y = np.arange(20) # Target
# Step 3: Split Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 4: Display Shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
Output:
X_train shape: (16, 1)
X_test shape: (4, 1)
y_train shape: (16,)
y_test shape: (4,)