Data Preparation

  • Prepare and structure datasets for analysis and modeling.
  • What is Data Cleaning?

    Data Cleaning is the process of:

    • Removing errors

    • Handling missing values

    • Fixing data types

    • Removing duplicates

    • Handling outliers

    Simple meaning:
    "Messy data ko clean & usable banana."



    Handling Missing Values

    Missing data can affect analysis.

    Check Missing Values

df.isnull().sum()
  • Methods to Handle Missing Data

    Method

    When to Use

    Remove rows

    If few missing values

    Fill with mean

    For numerical data

    Fill with median

    If outliers exist

    Fill with mode

    For categorical data


    Example:

df["marks"].fillna(df["marks"].mean(), inplace=True)
  • Removing Duplicates

    Duplicate records create incorrect analysis.

df.drop_duplicates(inplace=True)
  • Fixing Data Types

    Sometimes numbers are stored as text.

df["marks"] = df["marks"].astype(int)
  • Check data types:

df.info()
  • Handling Outliers

    Methods:

    • IQR method

    • Z-score

    • Cap values

    (We discussed earlier in Outlier Detection.)



    Standardizing Text Data

    Example:
    "Male", "male", "MALE"

df["gender"] = df["gender"].str.lower()
  • Renaming Columns

df.rename(columns={"Total Marks": "total_marks"}, inplace=True)