❮ Previous Next ❯

Train-Test Split

This lesson explains how train-test split divides datasets into two parts to train machine learning models and evaluate their performance.

Concept of Data Splitting
- Training Set: Used to train the model (learn patterns)
- Testing Set: Used to evaluate model performance on unseen data
Goal: Ensure the model generalizes well and is not overfitting to training data.

Typical Ratios
Split
Use Case
70% Train / 30% Test
Small to medium datasets
80% Train / 20% Test
Medium to large datasets
60% Train / 40% Test
When test accuracy is critical or dataset is large
- More training data → Better model learning
- More test data → Better evaluation accuracy
Random State
- Random state ensures reproducibility of splits
- If you don’t set it → split may vary every time
Random State = fixed number (e.g., 42) → always get same train-test split

When to Use Train-Test Split
- When dataset is large enough to have representative training and test sets
- When you want quick evaluation of model performance
- As a first step before using advanced validation techniques like K-Fold
Python Example

Train-Test Split Example in Python for Machine Learning

This Python example demonstrates how to split a dataset into training and testing sets using train_test_split from scikit-learn. The code creates a simple dataset, divides it into 80% training data and 20% testing data, and displays the shapes of each split. This process helps evaluate how well a machine learning model performs on unseen data.

# Step 1: Import Libraries
from sklearn.model_selection import train_test_split
import numpy as np

# Step 2: Example Dataset
X = np.arange(20).reshape(-1,1)  # Features
y = np.arange(20)                # Target

# Step 3: Split Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Display Shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:
X_train shape: (16, 1)
X_test shape: (4, 1)
y_train shape: (16,)
y_test shape: (4,)

❮ Previous Next ❯

Split	Use Case
70% Train / 30% Test	Small to medium datasets
80% Train / 20% Test	Medium to large datasets
60% Train / 40% Test	When test accuracy is critical or dataset is large

Concept of Data Splitting

Typical Ratios

Random State

When to Use Train-Test Split

Python Example

Train-Test Split Example in Python for Machine Learning

Login

Create Account