Summary & Patterns
-
Learn to summarize datasets and detect meaningful patterns.
Summary Statistics
What are Summary Statistics?
Summary statistics are numerical values that describe key features of a dataset.
They help answer:
What is the average?
How spread out is the data?
What is the minimum and maximum value?
Important Summary Measures
Central Tendency
Mean → Average
Median → Middle value
Mode → Most frequent value
Variability
Range → Max − Min
Variance
Standard Deviation
Distribution Shape
Skewness
Kurtosis
Example in Python
import pandas as pd
df = pd.read_csv("data.csv")
df.describe()
describe() gives:
count
mean
std
min
25%, 50%, 75%
max
🎯 Why Summary Statistics Matter?
✔ Quickly understand dataset
✔ Detect unusual values
✔ Compare different groups
✔ Prepare for modelingCorrelation Analysis
What is Correlation?
Correlation measures the relationship between two variables.
It tells:
Do they increase together?
Does one increase while other decreases?
Or no relationship?
Correlation Value Range
−1≤r≤1-1 \leq r \leq 1−1≤r≤1
Example
Study Hours vs Marks
If study hours increase & marks increase → Positive correlation
If one increases & other decreases → Negative correlation
Calculate Correlation in Python
df.corr()
- Visualizing Correlation (Heatmap)
Visualizing Correlation with Heatmap
This code uses Seaborn to create a heatmap of the correlation matrix, visually showing relationships between numerical variables in the dataset.
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()
Real-World Example
Banking
Income vs Loan Amount → Positive correlation
E-commerce
Discount vs Profit → Possibly negative correlation
Education
Attendance vs Marks → Positive correlation