Data Issues

  • Identify and resolve common data quality issues in datasets.
  • What is an Outlier?

    An Outlier is a data point that is very different from other values in the dataset.

    Example:

    Marks of students:
    50, 55, 60, 58, 57, 200

    200 is an outlier.


    Why Outliers Are Important?

    ✔ Can affect mean & standard deviation
    ✔ Can reduce model accuracy
    ✔ May indicate data entry error
    ✔ Sometimes represent important rare events



    Methods of Outlier Detection


    1. Using IQR (Interquartile Range) Method

    Formula:

    IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1

    Lower Bound:

    Q1−1.5×IQRQ1 - 1.5 \times IQRQ1−1.5×IQR

    Upper Bound:

    Q3+1.5×IQRQ3 + 1.5 \times IQRQ3+1.5×IQR

    Any value outside this range = Outlier


    Example:

Outlier Detection using IQR (Interquartile Range) Method

The IQR method identifies outliers by calculating the range between the first quartile (Q1) and third quartile (Q3). Any value below the lower bound or above the upper bound is considered an outlier.

import pandas as pd

data = [10, 12, 14, 15, 18, 100]
df = pd.DataFrame(data, columns=["values"])

Q1 = df["values"].quantile(0.25)
Q3 = df["values"].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df["values"] < lower) | (df["values"] > upper)]
print(outliers)
  • 2. Using Z-Score Method

    Z-score measures how many standard deviations a value is from mean.

    Z=X−MeanSDZ = \frac{X - Mean}{SD}Z=SDX−Mean​

    If:

    • |Z| > 3 → Possible outlier


    Example:

Outlier Detection using Z-Score Method

The Z-score method measures how many standard deviations a value is away from the mean.

import numpy as np

from scipy import stats


data = [10, 12, 14, 15, 18, 100]


z_scores = np.abs(stats.zscore(data))

print(z_scores)


print("Outliers:", np.where(z_scores > 3))
  • 3. Using Boxplot (Visual Method)

Outlier Detection using Boxplot (Visual Method)

A boxplot is a visual technique used to detect outliers. It is based on the IQR method.

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=data)
plt.show()
  • Points outside whiskers are outliers.