K-Nearest Neighbors (KNN)

  • K-Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies data points based on the majority class of their nearest neighbors.
  • Distance Metrics

    KNN works based on distance calculation between data points.

    The most common distance metrics are:

    1. Euclidean Distance (Most Common)

    d=(x1−y1)2+(x2−y2)2d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}d=(x1​−y1​)2+(x2​−y2​)2​

    Used when data is continuous and properly scaled.

    2. Manhattan Distance

    d=∣x1−y1∣+∣x2−y2∣d = |x_1 - y_1| + |x_2 - y_2|d=∣x1​−y1​∣+∣x2​−y2​∣

    Used when features represent grid-like distance.

    3. Minkowski Distance

    Generalized version:

    d=(∑∣xi−yi∣p)1/pd = \left( \sum |x_i - y_i|^p \right)^{1/p}d=(∑∣xi​−yi​∣p)1/p

    • p = 1 → Manhattan

    • p = 2 → Euclidean

    Important:

    Always scale data before using KNN because distance is sensitive to feature magnitude.

    Use:

    • StandardScaler

    • MinMaxScaler


    Choosing K Value

    K = Number of nearest neighbors.

    Small K (e.g., K=1)

    • Low bias

    • High variance

    • Risk of overfitting

    Large K (e.g., K=20)

    • High bias

    • Low variance

    • Risk of underfitting

    Best Practice

    • Choose odd K (avoid tie in binary classification)

    • Use cross-validation to find best K

    Example

    If K = 3:

    Among nearest 3 neighbors:

    • 2 are Class A

    • 1 is Class B

    Final prediction → Class A


    Lazy Learning

    KNN is called a Lazy Learning Algorithm because:

    • It does NOT build a model during training.

    • It stores the entire dataset.

    • Computation happens only at prediction time.

    Why “Lazy”?

    Training phase:

    • Just store data.

    Prediction phase:

    • Calculate distance to all points.

    • Sort them.

    • Select K nearest.

    So prediction is slower.


    Advantages & Limitations

    Advantages

    1. Simple and easy to understand

    2. No training phase required

    3. Works well with small datasets

    4. Can handle multi-class problems

    Limitations

    1. Slow prediction (computes distance to all points)

    2. Sensitive to irrelevant features

    3. Sensitive to feature scaling

    4. Poor performance on large datasets

    5. Curse of dimensionality (many features reduce accuracy)

    Example: Pass/Fail Classification

KNN Classification Example in Python for Student Pass/Fail Prediction

This Python example demonstrates how to use the K-Nearest Neighbors (KNN) algorithm to predict whether a student will pass or fail based on study hours and attendance. The code creates a dataset, splits it into training and testing sets, applies feature scaling using StandardScaler, trains a KNN model with K=3, and evaluates the model by making predictions and calculating accuracy.

# Step 1: Import Libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Step 2: Create Dataset
X = np.array([
    [2, 50],
    [3, 60],
    [5, 80],
    [6, 90],
    [1, 40]
])  # [Study Hours, Attendance]

y = np.array([0, 0, 1, 1, 0])  # 0 = Fail, 1 = Pass

# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Step 4: Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 5: Create Model (K=3)
model = KNeighborsClassifier(n_neighbors=3)

# Step 6: Train Model
model.fit(X_train, y_train)

# Step 7: Predict
prediction = model.predict(X_test)

print("Prediction:", prediction)
print("Accuracy:", model.score(X_test, y_test))
  • Output:

    Probability of Passing: 0.47913110199975184

    Predicted Class: 0

    Bias-Variance in KNN

    K Value

    Bias

    Variance

    Small K

    Low

    High

    Large K

    High

    Low