Next

Data Aggregation

  • Learn how to summarize and aggregate data using groupby and statistical functions.
  • groupby()

    What is groupby()?

    groupby() is used to:

    1. Split the data into groups

    2. Apply a function to each group

    3. Combine the results

    This concept follows the Split → Apply → Combine strategy.

    Sample Dataset

Creating a Pandas DataFrame from Dictionary

This code creates a Pandas DataFrame using a dictionary of employee data and displays it, demonstrating how structured tabular data is organized in Pandas.

import pandas as pd

data = {
    "Department": ["IT", "HR", "IT", "HR", "Finance", "IT"],
    "Employee": ["Amit", "Sara", "John", "Riya", "Karan", "Neha"],
    "Salary": [50000, 60000, 55000, 65000, 70000, 52000],
    "Experience": [2, 4, 3, 5, 6, 2]
}

df = pd.DataFrame(data)
print(df)
  • Group by One Column

Grouping Data by Department using Pandas

This code groups the DataFrame by the Department column using Pandas groupby() method, creating grouped objects that allow you to perform aggregate operations like sum, mean, count, etc., on each department.

grouped = df.groupby("Department")
print(grouped)
  • This creates grouped data but does not calculate anything yet.

    Calculate Mean Salary per Department

df.groupby("Department")["Salary"].mean()
  • Calculate Total Salary per Department

df.groupby("Department")["Salary"].sum()
  • Count Employees per Department

df.groupby("Department")["Employee"].count()
  • Aggregation Functions

    Aggregation functions calculate summary values for groups.

    Common Aggregation Functions

    Function

    Purpose

    sum()

    Total value

    mean()

    Average

    count()

    Number of entries

    min()

    Minimum value

    max()

    Maximum value

    median()

    Middle value

    std()

    Standard deviation

    var()

    Variance

    Multiple Aggregations

df.groupby("Department")["Salary"].agg(["sum", "mean", "max", "min"])
  • Aggregation on Multiple Columns

df.groupby("Department")[["Salary", "Experience"]].mean()
  • Custom Named Aggregations

df.groupby("Department").agg(
    Total_Salary=("Salary", "sum"),
    Avg_Salary=("Salary", "mean"),
    Max_Experience=("Experience", "max")
)
  • Grouping by Multiple Columns

df.groupby(["Department", "Experience"])["Salary"].mean()
  • This creates hierarchical grouping.



    Resetting Index

    After groupby, grouped column becomes index.

result = df.groupby("Department")["Salary"].mean().reset_index()
print(result)
  • Sorting Aggregated Results

df.groupby("Department")["Salary"].mean().sort_values(ascending=False)
  • Real-World Example – Sales Analysis

Regional Sales Analysis using Pandas GroupBy

This code groups sales data by region to calculate total sales and average orders, helping analyze regional performance.

sales_data = {
    "Region": ["North", "South", "North", "West", "South"],
    "Sales": [20000, 15000, 25000, 30000, 18000],
    "Orders": [200, 150, 220, 300, 160]
}

sales_df = pd.DataFrame(sales_data)

# Total sales per region
print(sales_df.groupby("Region")["Sales"].sum())

# Average orders per region
print(sales_df.groupby("Region")["Orders"].mean())
Next