Handling Missing Data
-
Learn techniques to detect and handle missing data in Pandas for clean datasets.
- Understanding Missing Data in Pandas
What is Missing Data?
Missing data refers to values that are:
Not available
Not recorded
Corrupted
Incomplete
In Pandas, missing values are represented as:
NaN (Not a Number)
None
NaT (for datetime)
Example Dataset with Missing Values
import pandas as pd
import numpy as np
data = {
"Name": ["Aman", "Riya", "Neha", "Karan"],
"Age": [22, np.nan, 23, 30],
"City": ["Delhi", "Mumbai", None, "Delhi"],
"Marks": [85, 90, np.nan, 92]
}
df = pd.DataFrame(data)
print(df)
Detecting Missing Values
isnull()
Returns True for missing values.
df.isnull()
- Count Missing Values Per Column
df.isnull().sum()
notnull()
Returns True for non-missing values.
df.notnull()
Checking If Any Missing Values Exist
df.isnull().any()
- Total Missing Values in DataFrame
df.isnull().sum().sum()
Removing Missing Data – dropna()
dropna() removes rows or columns containing missing values.
Drop Rows with Missing Values
df.dropna()
- Drop Columns with Missing Values
df.dropna(axis=1)
Axis Explanation
axis=0 → Drop rows
axis=1 → Drop columns
Drop Rows with All Missing Values
df.dropna(how="all")
- Drop Rows with Any Missing Value
df.dropna(how="any")
- Drop Rows Based on Specific Columns
df.dropna(subset=["Marks"])
- Apply Changes Permanently
df.dropna(inplace=True)
Filling Missing Data – fillna()
Instead of removing data, we can replace missing values.
Fill with Constant Value
df.fillna(0)
- Fill Specific Column
df["Age"].fillna(25, inplace=True)
Fill with Mean (Numeric Data)
df["Marks"].fillna(df["Marks"].mean(), inplace=True)
- Fill with Median
df["Marks"].fillna(df["Marks"].median(), inplace=True)
- Fill with Mode (Categorical Data)
df["City"].fillna(df["City"].mode()[0], inplace=True)
Forward Fill (ffill)
Fills missing value using previous row value.
df.fillna(method="ffill", inplace=True)
Backward Fill (bfill)
Fills using next row value.
df.fillna(method="bfill", inplace=True)
Advanced Missing Data Handling
Interpolation
Used for numeric trends.
df["Marks"].interpolate(inplace=True)
- Replacing Specific Values with NaN
df.replace("", np.nan, inplace=True)
- Checking Percentage of Missing Data
(df.isnull().sum() / len(df)) * 100
Choosing the Right Strategy
Real-World Workflow
Inspect data (df.info())
Detect missing values (isnull().sum())
Analyze percentage
Choose strategy (drop or fill)
Validate cleaned data
Common Mistakes to Avoid
Dropping too much data
Filling numeric data with wrong values
Ignoring missing data before modeling
Using mean for skewed data
Best PracticesAlways analyze before cleaning
Document cleaning steps
Use median for skewed numeric data
Validate dataset after cleaning
Check impact on model performance
Mini Practical Example
# Step 1: Check missing values
print(df.isnull().sum())
# Step 2: Fill numeric columns
df["Age"].fillna(df["Age"].mean(), inplace=True)
# Step 3: Fill categorical columns
df["City"].fillna(df["City"].mode()[0], inplace=True)
# Step 4: Verify
print(df.isnull().sum())