Data Transformation

  • Learn how to modify, map, and transform datasets using Pandas.
  • Renaming Columns

    Renaming columns improves readability and standardizes dataset structure.

    Why Rename Columns?

    • Make column names meaningful

    • Remove spaces or special characters

    • Follow naming conventions

    • Prepare data for modeling

    Example Dataset

import pandas as pd

data = {
    "emp id": [101, 102, 103],
    "emp name": ["Amit", "Sara", "John"],
    "emp salary": [50000, 60000, 55000]
}

df = pd.DataFrame(data)
print(df)
  • Rename Specific Columns

df.rename(columns={
    "emp id": "employee_id",
    "emp name": "employee_name"
}, inplace=True)

print(df)
  • Rename All Columns

df.columns = ["id", "name", "salary"]
print(df)
  • Convert Column Names to Lowercase

df.columns = df.columns.str.lower()
print(df)
  • Type Conversion

    Type conversion changes the data type of a column.

    Why Convert Data Types?

    • Fix incorrect imports (e.g., numbers stored as strings)

    • Perform mathematical operations

    • Improve memory efficiency

    Check Data Types

print(df.dtypes)
  • Convert to Integer

df["salary"] = df["salary"].astype(int)
  • Convert String to Date

df["joining_date"] = pd.to_datetime("2024-01-01")
  • Convert Column to Float

df["salary"] = df["salary"].astype(float)
  • Handle Conversion Errors

df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
  • errors="coerce" converts invalid values to NaN.



    apply() Function

    apply() allows applying a custom function to rows or columns.

    Apply on Single Column

df["salary_in_lakhs"] = df["salary"].apply(lambda x: x / 100000)
print(df)
  • Apply Custom Function

def bonus(salary):
    return salary + 5000

df["salary_with_bonus"] = df["salary"].apply(bonus)
print(df)
  • Apply on Entire DataFrame

df["salary_double"] = df["salary"].apply(lambda x: x * 2)
  • Apply Row-wise (axis=1)

df["tax"] = df.apply(lambda row: row["salary"] * 0.1, axis=1)
  • map() Function

    map() is used mainly with Series to replace or transform values.

    Example – Mapping Categories

df["department"] = ["IT", "HR", "IT"]

dept_map = {
    "IT": "Technology",
    "HR": "Human Resources"
}

df["department_full"] = df["department"].map(dept_map)
print(df)
  • Using map with Lambda

df["name_length"] = df["name"].map(lambda x: len(x))
print(df)
  • Difference Between apply() and map()

    Feature

    apply()

    map()

    Works On

    Series & DataFrame

    Series only

    Custom Function

    Yes

    Yes

    Dictionary Mapping

    No

    Yes

    Row-wise Operation

    Yes

    No


    Real-World Data Transformation Example

import pandas as pd

data = {
    "Name": ["Amit", "Sara", "John"],
    "Age": ["25", "30", "28"],
    "Salary": ["50000", "60000", "55000"]
}

df = pd.DataFrame(data)

# Convert data types
df["Age"] = df["Age"].astype(int)
df["Salary"] = df["Salary"].astype(int)

# Rename columns
df.rename(columns={"Name": "Employee_Name"}, inplace=True)

# Add new column
df["Salary_After_Tax"] = df["Salary"].apply(lambda x: x * 0.9)

print(df)
  • Why Data Transformation is Important

    • Improves data quality

    • Makes data consistent

    • Prepares data for analysis

    • Required before visualization or machine learning

    • Enhances performance and accuracy