Merging & Joining

  • Learn how to combine multiple datasets using merge, join, and concatenate in Pandas.
  • merge()

    What is merge()?

    merge() is used to combine two DataFrames based on a common column, similar to SQL JOIN.

    It is the most powerful and flexible method for combining datasets.

    Example Dataset

Creating Employees and Departments DataFrames

This code creates two Pandas DataFrames — one for employees and one for departments — which can later be used to perform merge (join) operations based on the common dept_id column.

import pandas as pd

employees = pd.DataFrame({
    "emp_id": [1, 2, 3, 4],
    "name": ["Amit", "Sara", "John", "Riya"],
    "dept_id": [101, 102, 101, 103]
})

departments = pd.DataFrame({
    "dept_id": [101, 102, 103],
    "department": ["IT", "HR", "Finance"]
})

print(employees)
print(departments)
  • Inner Join (Default)

    Returns only matching rows.

Performing Inner Join using Pandas

This code merges the employees and departments DataFrames using an inner join on dept_id, returning only the rows where matching department IDs exist in both tables.

result = pd.merge(employees, departments, on="dept_id")
print(result)
  • Left Join

    Returns all rows from left DataFrame.

Performing Left Join using Pandas

This code merges the employees and departments DataFrames using a left join on dept_id, returning all rows from the employees DataFrame and matching rows from departments. If no match is found, missing values will appear as NaN.

result = pd.merge(employees, departments, on="dept_id", how="left")
print(result)
  • Right Join

    Returns all rows from right DataFrame.

Performing Right Join using Pandas

This code merges the employees and departments DataFrames using a right join on dept_id, returning all rows from the departments DataFrame and matching rows from employees. If no match is found, missing values will appear as NaN.

result = pd.merge(employees, departments, on="dept_id", how="right")
print(result)
  • Outer Join

    Returns all rows from both DataFrames.

Performing Outer Join using Pandas

This code merges the employees and departments DataFrames using an outer join on dept_id, returning all rows from both DataFrames. If there is no matching value in either table, the missing fields will be filled with NaN.

result = pd.merge(employees, departments, on="dept_id", how="outer")
print(result)
  • Merge on Different Column Names

pd.merge(df1, df2, left_on="id1", right_on="id2")
  • join()

    What is join()?

    join() is used to combine DataFrames based on their index.

    It is simpler than merge but less flexible.

    Example

Combining DataFrames Using join()

This code demonstrates how to combine two DataFrames based on their index using join(), creating a single DataFrame with both employee names and salaries.

emp = pd.DataFrame({
    "emp_id": [1, 2, 3],
    "name": ["Amit", "Sara", "John"]
}).set_index("emp_id")

salary = pd.DataFrame({
    "emp_id": [1, 2, 3],
    "salary": [50000, 60000, 55000]
}).set_index("emp_id")

result = emp.join(salary)
print(result)
  • Join with how Parameter

emp.join(salary, how="inner")
emp.join(salary, how="outer")
  • concat()

    What is concat()?

    concat() is used to combine DataFrames along rows or columns.

    Unlike merge, it does NOT require a common column.

    Concatenate Rows (Default)

Concatenating DataFrames by Rows

This code uses pd.concat() to combine two DataFrames vertically (row-wise), stacking all rows from both DataFrames into a single DataFrame.

df1 = pd.DataFrame({
    "Name": ["Amit", "Sara"],
    "Age": [25, 30]
})

df2 = pd.DataFrame({
    "Name": ["John", "Riya"],
    "Age": [28, 32]
})

result = pd.concat([df1, df2])
print(result)
  •  Reset Index After Concat

result = pd.concat([df1, df2], ignore_index=True)
print(result)
  • Concatenate Columns

result = pd.concat([df1, df2], axis=1)
print(result)
  • Difference Between merge(), join(), concat()

    Feature

    merge()

    join()

    concat()

    Based On

    Column

    Index

    Axis

    SQL-style join

    Yes

    Limited

    No

    Combine Rows

    Yes

    No

    Yes

    Combine Columns

    Yes

    Yes

    Yes

    Most Flexible

    Yes

    Medium

    Simple stacking