ML Workflow

  • This module explains the basic Machine Learning workflow, focusing on data preparation and model training to build effective predictive models.
  • Data Preparation

    Data preparation is the most important step in Machine Learning.
    Almost 70% of the time is spent on data cleaning.

    Step 1: Data Collection

    Data sources can be:

    • Excel file

    • Database

    • API

    • CSV file

    Example (Student Dataset):

    Study Hours

    Attendance

    Result

    2

    60%

    Fail

    5

    85%

    Pass

    Step 2: Data Cleaning

    • Handle missing values

    • Remove duplicates

    • Fix incorrect data

    Example:

    • Blank attendance → fill with average

    • Duplicate rows → remove

    Step 3: Feature Selection

    Not all columns are useful for the model.

    Example:

    • Student ID → Not useful

    • Study Hours → Useful

    Step 4: Feature Engineering

    Creating new meaningful features from existing data.

    Example:

    • Convert Attendance % into categories

    • Calculate Total Score

    Step 5: Data Encoding

    Machine Learning models cannot understand text.
    Convert text into numbers.

    Example:

    Result

    Pass

    Fail


    Convert to:

    • Pass = 1

    • Fail = 0

    Step 6: Data Splitting

    Divide dataset into two parts:

    • Training Data (70–80%)

    • Testing Data (20–30%)

    Example:

    If dataset has 1000 rows:

    • 800 → Training

    • 200 → Testing


    Model Training

    Now we train the machine using data.

    Step 1: Select Algorithm

    Choose algorithm based on problem type:

    • Regression → Linear Regression

    • Classification → Logistic Regression

    • Clustering → K-Means

    Step 2: Train Model

    The model learns patterns from training data.

    Example:
    The machine learns:
    “More study hours → Higher chance of passing”

    Step 3: Model Testing

    Use testing data to check whether the model predicts correctly or not.

    Step 4: Evaluate Model

    Regression Metrics:

    • MAE (Mean Absolute Error)

    • MSE (Mean Squared Error)

    • R² Score

    Classification Metrics:

    • Accuracy

    • Precision

    • Recall

    • Confusion Matrix