Next

Handling Missing Data

  • Learn techniques to detect and handle missing data in Pandas for clean datasets.
  • Understanding Missing Data in Pandas

    What is Missing Data?

    Missing data refers to values that are:

    • Not available

    • Not recorded

    • Corrupted

    • Incomplete

    In Pandas, missing values are represented as:

    • NaN (Not a Number)

    • None

    • NaT (for datetime)

    Example Dataset with Missing Values

import pandas as pd
import numpy as np

data = {
    "Name": ["Aman", "Riya", "Neha", "Karan"],
    "Age": [22, np.nan, 23, 30],
    "City": ["Delhi", "Mumbai", None, "Delhi"],
    "Marks": [85, 90, np.nan, 92]
}

df = pd.DataFrame(data)
print(df)
  • Detecting Missing Values

    isnull()

    Returns True for missing values.

df.isnull()
  • Count Missing Values Per Column

df.isnull().sum()
  • notnull()

    Returns True for non-missing values.

df.notnull()
  • Checking If Any Missing Values Exist

df.isnull().any()
  • Total Missing Values in DataFrame

df.isnull().sum().sum()
  • Removing Missing Data – dropna()

    dropna() removes rows or columns containing missing values.

    Drop Rows with Missing Values

df.dropna()
  • Drop Columns with Missing Values

df.dropna(axis=1)
  • Axis Explanation

    • axis=0 → Drop rows

    • axis=1 → Drop columns

    Drop Rows with All Missing Values

df.dropna(how="all")
  • Drop Rows with Any Missing Value

df.dropna(how="any")
  • Drop Rows Based on Specific Columns

df.dropna(subset=["Marks"])
  • Apply Changes Permanently

df.dropna(inplace=True)
  • Filling Missing Data – fillna()

    Instead of removing data, we can replace missing values.

    Fill with Constant Value

df.fillna(0)
  • Fill Specific Column

df["Age"].fillna(25, inplace=True)
  • Fill with Mean (Numeric Data)

df["Marks"].fillna(df["Marks"].mean(), inplace=True)
  • Fill with Median

df["Marks"].fillna(df["Marks"].median(), inplace=True)
  • Fill with Mode (Categorical Data)

df["City"].fillna(df["City"].mode()[0], inplace=True)
  • Forward Fill (ffill)

    Fills missing value using previous row value.

df.fillna(method="ffill", inplace=True)
  • Backward Fill (bfill)

    Fills using next row value.

df.fillna(method="bfill", inplace=True)
  • Advanced Missing Data Handling

    Interpolation

    Used for numeric trends.

df["Marks"].interpolate(inplace=True)
  • Replacing Specific Values with NaN

df.replace("", np.nan, inplace=True)
  • Checking Percentage of Missing Data

(df.isnull().sum() / len(df)) * 100
  • Choosing the Right Strategy

    Scenario

    Recommended Approach

    Few missing rows

    dropna()

    Numeric column

    fill with mean/median

    Categorical column

    fill with mode

    Time-series data

    forward fill

    Large missing percentage

    Consider dropping column



    Real-World Workflow

    1. Inspect data (df.info())

    2. Detect missing values (isnull().sum())

    3. Analyze percentage

    4. Choose strategy (drop or fill)

    5. Validate cleaned data


    Common Mistakes to Avoid

    Dropping too much data
    Filling numeric data with wrong values
    Ignoring missing data before modeling
    Using mean for skewed data


    Best Practices
    • Always analyze before cleaning

    • Document cleaning steps

    • Use median for skewed numeric data

    • Validate dataset after cleaning

    • Check impact on model performance


    Mini Practical Example

# Step 1: Check missing values
print(df.isnull().sum())

# Step 2: Fill numeric columns
df["Age"].fillna(df["Age"].mean(), inplace=True)

# Step 3: Fill categorical columns
df["City"].fillna(df["City"].mode()[0], inplace=True)

# Step 4: Verify
print(df.isnull().sum())
Next