Data Aggregation
-
Learn how to summarize and aggregate data using groupby and statistical functions.
groupby()
What is groupby()?
groupby() is used to:
Split the data into groups
Apply a function to each group
Combine the results
This concept follows the Split → Apply → Combine strategy.
Sample Dataset
Creating a Pandas DataFrame from Dictionary
This code creates a Pandas DataFrame using a dictionary of employee data and displays it, demonstrating how structured tabular data is organized in Pandas.
import pandas as pd
data = {
"Department": ["IT", "HR", "IT", "HR", "Finance", "IT"],
"Employee": ["Amit", "Sara", "John", "Riya", "Karan", "Neha"],
"Salary": [50000, 60000, 55000, 65000, 70000, 52000],
"Experience": [2, 4, 3, 5, 6, 2]
}
df = pd.DataFrame(data)
print(df)
Group by One Column
Grouping Data by Department using Pandas
This code groups the DataFrame by the Department column using Pandas groupby() method, creating grouped objects that allow you to perform aggregate operations like sum, mean, count, etc., on each department.
grouped = df.groupby("Department")
print(grouped)
This creates grouped data but does not calculate anything yet.
Calculate Mean Salary per Department
df.groupby("Department")["Salary"].mean()
Calculate Total Salary per Department
df.groupby("Department")["Salary"].sum()
Count Employees per Department
df.groupby("Department")["Employee"].count()
Aggregation Functions
Aggregation functions calculate summary values for groups.
Common Aggregation Functions
Multiple Aggregations
df.groupby("Department")["Salary"].agg(["sum", "mean", "max", "min"])
- Aggregation on Multiple Columns
df.groupby("Department")[["Salary", "Experience"]].mean()
- Custom Named Aggregations
df.groupby("Department").agg(
Total_Salary=("Salary", "sum"),
Avg_Salary=("Salary", "mean"),
Max_Experience=("Experience", "max")
)
- Grouping by Multiple Columns
df.groupby(["Department", "Experience"])["Salary"].mean()
This creates hierarchical grouping.
Resetting Index
After groupby, grouped column becomes index.
result = df.groupby("Department")["Salary"].mean().reset_index()
print(result)
- Sorting Aggregated Results
df.groupby("Department")["Salary"].mean().sort_values(ascending=False)
- Real-World Example – Sales Analysis
Regional Sales Analysis using Pandas GroupBy
This code groups sales data by region to calculate total sales and average orders, helping analyze regional performance.
sales_data = {
"Region": ["North", "South", "North", "West", "South"],
"Sales": [20000, 15000, 25000, 30000, 18000],
"Orders": [200, 150, 220, 300, 160]
}
sales_df = pd.DataFrame(sales_data)
# Total sales per region
print(sales_df.groupby("Region")["Sales"].sum())
# Average orders per region
print(sales_df.groupby("Region")["Orders"].mean())