Merging & Joining
-
Learn how to combine multiple datasets using merge, join, and concatenate in Pandas.
merge()
What is merge()?
merge() is used to combine two DataFrames based on a common column, similar to SQL JOIN.
It is the most powerful and flexible method for combining datasets.
Example Dataset
Creating Employees and Departments DataFrames
This code creates two Pandas DataFrames — one for employees and one for departments — which can later be used to perform merge (join) operations based on the common dept_id column.
import pandas as pd
employees = pd.DataFrame({
"emp_id": [1, 2, 3, 4],
"name": ["Amit", "Sara", "John", "Riya"],
"dept_id": [101, 102, 101, 103]
})
departments = pd.DataFrame({
"dept_id": [101, 102, 103],
"department": ["IT", "HR", "Finance"]
})
print(employees)
print(departments)
Inner Join (Default)
Returns only matching rows.
Performing Inner Join using Pandas
This code merges the employees and departments DataFrames using an inner join on dept_id, returning only the rows where matching department IDs exist in both tables.
result = pd.merge(employees, departments, on="dept_id")
print(result)
Left Join
Returns all rows from left DataFrame.
Performing Left Join using Pandas
This code merges the employees and departments DataFrames using a left join on dept_id, returning all rows from the employees DataFrame and matching rows from departments. If no match is found, missing values will appear as NaN.
result = pd.merge(employees, departments, on="dept_id", how="left")
print(result)
Right Join
Returns all rows from right DataFrame.
Performing Right Join using Pandas
This code merges the employees and departments DataFrames using a right join on dept_id, returning all rows from the departments DataFrame and matching rows from employees. If no match is found, missing values will appear as NaN.
result = pd.merge(employees, departments, on="dept_id", how="right")
print(result)
Outer Join
Returns all rows from both DataFrames.
Performing Outer Join using Pandas
This code merges the employees and departments DataFrames using an outer join on dept_id, returning all rows from both DataFrames. If there is no matching value in either table, the missing fields will be filled with NaN.
result = pd.merge(employees, departments, on="dept_id", how="outer")
print(result)
Merge on Different Column Names
pd.merge(df1, df2, left_on="id1", right_on="id2")
join()
What is join()?
join() is used to combine DataFrames based on their index.
It is simpler than merge but less flexible.
Example
Combining DataFrames Using join()
This code demonstrates how to combine two DataFrames based on their index using join(), creating a single DataFrame with both employee names and salaries.
emp = pd.DataFrame({
"emp_id": [1, 2, 3],
"name": ["Amit", "Sara", "John"]
}).set_index("emp_id")
salary = pd.DataFrame({
"emp_id": [1, 2, 3],
"salary": [50000, 60000, 55000]
}).set_index("emp_id")
result = emp.join(salary)
print(result)
- Join with how Parameter
emp.join(salary, how="inner")
emp.join(salary, how="outer")
concat()
What is concat()?
concat() is used to combine DataFrames along rows or columns.
Unlike merge, it does NOT require a common column.
Concatenate Rows (Default)
Concatenating DataFrames by Rows
This code uses pd.concat() to combine two DataFrames vertically (row-wise), stacking all rows from both DataFrames into a single DataFrame.
df1 = pd.DataFrame({
"Name": ["Amit", "Sara"],
"Age": [25, 30]
})
df2 = pd.DataFrame({
"Name": ["John", "Riya"],
"Age": [28, 32]
})
result = pd.concat([df1, df2])
print(result)
- Reset Index After Concat
result = pd.concat([df1, df2], ignore_index=True)
print(result)
- Concatenate Columns
result = pd.concat([df1, df2], axis=1)
print(result)
Difference Between merge(), join(), concat()