Data Preparation
-
Prepare and structure datasets for analysis and modeling.
What is Data Cleaning?
Data Cleaning is the process of:
Removing errors
Handling missing values
Fixing data types
Removing duplicates
Handling outliers
Simple meaning:
"Messy data ko clean & usable banana."Handling Missing Values
Missing data can affect analysis.
Check Missing Values
df.isnull().sum()
Methods to Handle Missing Data
Example:
df["marks"].fillna(df["marks"].mean(), inplace=True)
Removing Duplicates
Duplicate records create incorrect analysis.
df.drop_duplicates(inplace=True)
Fixing Data Types
Sometimes numbers are stored as text.
df["marks"] = df["marks"].astype(int)
- Check data types:
df.info()
Handling Outliers
Methods:
IQR method
Z-score
Cap values
(We discussed earlier in Outlier Detection.)
Standardizing Text Data
Example:
"Male", "male", "MALE"
df["gender"] = df["gender"].str.lower()
Renaming Columns
df.rename(columns={"Total Marks": "total_marks"}, inplace=True)