Data Issues
-
Identify and resolve common data quality issues in datasets.
What is an Outlier?
An Outlier is a data point that is very different from other values in the dataset.
Example:
Marks of students:
50, 55, 60, 58, 57, 200200 is an outlier.
Why Outliers Are Important?
✔ Can affect mean & standard deviation
✔ Can reduce model accuracy
✔ May indicate data entry error
✔ Sometimes represent important rare eventsMethods of Outlier Detection
1. Using IQR (Interquartile Range) Method
Formula:
IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1
Lower Bound:
Q1−1.5×IQRQ1 - 1.5 \times IQRQ1−1.5×IQR
Upper Bound:
Q3+1.5×IQRQ3 + 1.5 \times IQRQ3+1.5×IQR
Any value outside this range = Outlier
Example:
Outlier Detection using IQR (Interquartile Range) Method
The IQR method identifies outliers by calculating the range between the first quartile (Q1) and third quartile (Q3). Any value below the lower bound or above the upper bound is considered an outlier.
import pandas as pd
data = [10, 12, 14, 15, 18, 100]
df = pd.DataFrame(data, columns=["values"])
Q1 = df["values"].quantile(0.25)
Q3 = df["values"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df["values"] < lower) | (df["values"] > upper)]
print(outliers)
2. Using Z-Score Method
Z-score measures how many standard deviations a value is from mean.
Z=X−MeanSDZ = \frac{X - Mean}{SD}Z=SDX−Mean
If:
|Z| > 3 → Possible outlier
Example:
Outlier Detection using Z-Score Method
The Z-score method measures how many standard deviations a value is away from the mean.
import numpy as np
from scipy import stats
data = [10, 12, 14, 15, 18, 100]
z_scores = np.abs(stats.zscore(data))
print(z_scores)
print("Outliers:", np.where(z_scores > 3))
3. Using Boxplot (Visual Method)
Outlier Detection using Boxplot (Visual Method)
A boxplot is a visual technique used to detect outliers. It is based on the IQR method.
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(data=data)
plt.show()
Points outside whiskers are outliers.