DBSCAN Clustering

  • DBSCAN is a density-based clustering algorithm that groups closely packed data points and identifies noise or outliers in datasets.
  • DBSCAN Clustering

    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised learning algorithm used for clustering based on density.
    It is especially useful for arbitrary-shaped clusters and handling outliers.


    Density-Based Clustering

    • Clusters are formed where points are densely packed together

    • Sparse regions are considered noise or outliers

    • Unlike K-Means, no need to specify the number of clusters


    Epsilon (ε)

    • ε defines the radius of a neighborhood around a point

    • Points within this distance are considered neighbors


    Minimum Points (MinPts)

    • MinPts = minimum number of points required to form a dense region (cluster)

    • Helps distinguish core points from border points


    Core, Border, and Noise Points

    Point Type

    Definition

    Core Point

    Has ≥ MinPts points in ε-neighborhood

    Border Point

    Has < MinPts neighbors but lies in ε-neighborhood of a core point

    Noise/Outlier

    Not a core or border point


    How DBSCAN Works

    1. Select a random point

    2. Check ε-neighborhood

    3. If neighbors ≥ MinPts → core point, form cluster

    4. Expand cluster by recursively including neighbors

    5. Repeat for all points

    6. Points not in any cluster → noise/outliers

    Example: DBSCAN

DBSCAN Clustering Example in Python for Detecting Clusters and Outliers

This Python example demonstrates how to use the DBSCAN clustering algorithm to group data points based on density. The code creates a dataset, applies DBSCAN with specified eps (neighborhood radius) and min_samples, and predicts cluster labels. It also identifies noise or outlier points (labeled as -1) and visualizes the clusters using Matplotlib.

# Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

# Step 2: Create Dataset
X = np.array([
    [1, 2],
    [2, 2],
    [2, 3],
    [8, 7],
    [8, 8],
    [25, 80]
])

# Step 3: Create DBSCAN Model
# eps = neighborhood radius, min_samples = MinPts
dbscan = DBSCAN(eps=3, min_samples=2)
labels = dbscan.fit_predict(X)

# Step 4: Print Cluster Labels
print("Cluster Labels:", labels)  
# -1 means noise/outlier

# Step 5: Plot Clusters
plt.scatter(X[:,0], X[:,1], c=labels, cmap='plasma', s=100)
plt.title("DBSCAN Clustering")
plt.show()
  • Output:

Lesson image
  • Cluster Labels: [ 0  0  0  1  1 -1]

    Noise & Outliers

    • DBSCAN can detect outliers automatically

    • Points labeled -1 → noise

    • Advantage over K-Means which assigns every point to a cluster

    Advantages & Limitations

    Advantages

    1. Detects clusters of arbitrary shape

    2. Handles outliers/noise naturally

    3. No need to specify number of clusters

    Limitations

    1. Sensitive to ε and MinPts parameters

    2. Not effective for varying density clusters

    Performance decreases on high-dimensional data