K-Nearest Neighbors (KNN)
- K-Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies data points based on the majority class of their nearest neighbors.
Distance Metrics
KNN works based on distance calculation between data points.
The most common distance metrics are:
1. Euclidean Distance (Most Common)
d=(x1−y1)2+(x2−y2)2d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}d=(x1−y1)2+(x2−y2)2
Used when data is continuous and properly scaled.
2. Manhattan Distance
d=∣x1−y1∣+∣x2−y2∣d = |x_1 - y_1| + |x_2 - y_2|d=∣x1−y1∣+∣x2−y2∣
Used when features represent grid-like distance.
3. Minkowski Distance
Generalized version:
d=(∑∣xi−yi∣p)1/pd = \left( \sum |x_i - y_i|^p \right)^{1/p}d=(∑∣xi−yi∣p)1/p
p = 1 → Manhattan
p = 2 → Euclidean
Important:
Always scale data before using KNN because distance is sensitive to feature magnitude.
Use:
StandardScaler
MinMaxScaler
Choosing K Value
K = Number of nearest neighbors.
Small K (e.g., K=1)
Low bias
High variance
Risk of overfitting
Large K (e.g., K=20)
High bias
Low variance
Risk of underfitting
Best Practice
Choose odd K (avoid tie in binary classification)
Use cross-validation to find best K
Example
If K = 3:
Among nearest 3 neighbors:
2 are Class A
1 is Class B
Final prediction → Class A
Lazy Learning
KNN is called a Lazy Learning Algorithm because:
It does NOT build a model during training.
It stores the entire dataset.
Computation happens only at prediction time.
Why “Lazy”?
Training phase:
Just store data.
Prediction phase:
Calculate distance to all points.
Sort them.
Select K nearest.
So prediction is slower.
Advantages & Limitations
Advantages
Simple and easy to understand
No training phase required
Works well with small datasets
Can handle multi-class problems
Limitations
Slow prediction (computes distance to all points)
Sensitive to irrelevant features
Sensitive to feature scaling
Poor performance on large datasets
Curse of dimensionality (many features reduce accuracy)
Example: Pass/Fail Classification
KNN Classification Example in Python for Student Pass/Fail Prediction
This Python example demonstrates how to use the K-Nearest Neighbors (KNN) algorithm to predict whether a student will pass or fail based on study hours and attendance. The code creates a dataset, splits it into training and testing sets, applies feature scaling using StandardScaler, trains a KNN model with K=3, and evaluates the model by making predictions and calculating accuracy.
# Step 1: Import Libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Step 2: Create Dataset
X = np.array([
[2, 50],
[3, 60],
[5, 80],
[6, 90],
[1, 40]
]) # [Study Hours, Attendance]
y = np.array([0, 0, 1, 1, 0]) # 0 = Fail, 1 = Pass
# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Step 4: Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Step 5: Create Model (K=3)
model = KNeighborsClassifier(n_neighbors=3)
# Step 6: Train Model
model.fit(X_train, y_train)
# Step 7: Predict
prediction = model.predict(X_test)
print("Prediction:", prediction)
print("Accuracy:", model.score(X_test, y_test))
Output:
Probability of Passing: 0.47913110199975184
Predicted Class: 0
Bias-Variance in KNN