Data Transformation
-
Learn how to modify, map, and transform datasets using Pandas.
Renaming Columns
Renaming columns improves readability and standardizes dataset structure.
Why Rename Columns?
Make column names meaningful
Remove spaces or special characters
Follow naming conventions
Prepare data for modeling
Example Dataset
import pandas as pd
data = {
"emp id": [101, 102, 103],
"emp name": ["Amit", "Sara", "John"],
"emp salary": [50000, 60000, 55000]
}
df = pd.DataFrame(data)
print(df)
- Rename Specific Columns
df.rename(columns={
"emp id": "employee_id",
"emp name": "employee_name"
}, inplace=True)
print(df)
- Rename All Columns
df.columns = ["id", "name", "salary"]
print(df)
- Convert Column Names to Lowercase
df.columns = df.columns.str.lower()
print(df)
Type Conversion
Type conversion changes the data type of a column.
Why Convert Data Types?
Fix incorrect imports (e.g., numbers stored as strings)
Perform mathematical operations
Improve memory efficiency
Check Data Types
print(df.dtypes)
- Convert to Integer
df["salary"] = df["salary"].astype(int)
- Convert String to Date
df["joining_date"] = pd.to_datetime("2024-01-01")
Convert Column to Float
df["salary"] = df["salary"].astype(float)
- Handle Conversion Errors
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
errors="coerce" converts invalid values to NaN.
apply() Function
apply() allows applying a custom function to rows or columns.
Apply on Single Column
df["salary_in_lakhs"] = df["salary"].apply(lambda x: x / 100000)
print(df)
- Apply Custom Function
def bonus(salary):
return salary + 5000
df["salary_with_bonus"] = df["salary"].apply(bonus)
print(df)
- Apply on Entire DataFrame
df["salary_double"] = df["salary"].apply(lambda x: x * 2)
- Apply Row-wise (axis=1)
df["tax"] = df.apply(lambda row: row["salary"] * 0.1, axis=1)
map() Function
map() is used mainly with Series to replace or transform values.
Example – Mapping Categories
df["department"] = ["IT", "HR", "IT"]
dept_map = {
"IT": "Technology",
"HR": "Human Resources"
}
df["department_full"] = df["department"].map(dept_map)
print(df)
- Using map with Lambda
df["name_length"] = df["name"].map(lambda x: len(x))
print(df)
Difference Between apply() and map()
Real-World Data Transformation Example
import pandas as pd
data = {
"Name": ["Amit", "Sara", "John"],
"Age": ["25", "30", "28"],
"Salary": ["50000", "60000", "55000"]
}
df = pd.DataFrame(data)
# Convert data types
df["Age"] = df["Age"].astype(int)
df["Salary"] = df["Salary"].astype(int)
# Rename columns
df.rename(columns={"Name": "Employee_Name"}, inplace=True)
# Add new column
df["Salary_After_Tax"] = df["Salary"].apply(lambda x: x * 0.9)
print(df)
Why Data Transformation is Important
Improves data quality
Makes data consistent
Prepares data for analysis
Required before visualization or machine learning
Enhances performance and accuracy