DBSCAN, Explained in 5 Minutes. Fastest implementation in python🐍 | by Aleksei Rozanov | Aug, 2024

Fastest implementation in python🐍

What’s DBSCAN [1]? How to build it in python? There are many articles covering this topic, but I think the algorithm itself is so simple and intuitive that it’s possible to explain its idea in just 5 minutes, so let’s try to do that.

DBSCAN = Density-Based Spatial Clustering of Applications with Noise

What does it mean?

The algorithm searches for clusters inside the data based on the spatial distance between objects.
The algorithm can identify outliers (noise).

Why do you need DBSCAN at all???

Extract a new feature. If the dataset you’re dealing with is large, it might be helpful to find obvious clusters inside the data and work with each cluster separately (train different models for different clusters).
Compress the data. Often we have to deal with millions of rows, which is expensive computationally and time consuming. Clustering the data and then keeping only X% from each cluster might save your wicked data science soul. Therefore, you’ll keep the balance inside the dataset, but reduce its size.
Novelty detection. It’s been mentioned before that DBSCAN detects noise, but the noise might be a previously unknown feature of the dataset, which you can preserve and use in modeling.

Then you may say: but there is the super-reliable and effective k-means algorithm.

Yes, but the sweetest part about DBSCAN is that it overcomes the drawbacks of k-means, and you don’t need to specify the number of clusters. DBSCAN detects clusters for you!

DBSCAN has two components defined by a user: vicinity, or radius (𝜀), and the number of neighbors (N).

For a dataset consisting of some objects, the algorithm is based on the following ideas:

Core objets. An object is called a core object if within distance 𝜀 it has at least N other objects.
An non-core object lying within 𝜀-vicinity of a core-point is called a border object.
A core object forms a cluster with all the core and border objects within 𝜀-vicinity.
If an object is neither core or border, it’s called noise (outlier). It doesn’t belong to any cluster.

To implement DBSCAN it’s necessary to create a distance function. In this article we will be using the Euclidean distance:

The pseudo-code for our algorithm looks like this:

As always the code of this article you can find on my GitHub.

Let’s begin with the distance function:

def distances(object, data):
euclidean = []
for row in data: #iterating through all the objects in the dataset
d = 0
for i in range(data.shape[1]): #calculating sum of squared residuals for all the coords
d+=(row[i]-object[i])**2
euclidean.append(d**0.5) #taking a sqaure root
return np.array(euclidean)

Now let’s build the body of the algorithm:

def DBSCAN(data, epsilon=0.5, N=3):
visited, noise = [], [] #lists to collect visited points and outliers
clusters = [] #list to collect clusters
for i in range(data.shape[0]): #iterating through all the points
if i not in visited: #getting in if the point's not visited
visited.append(i)
d = distances(data[i], data) #getting distances to all the other points
neighbors = list(np.where((d<=epsilon)&(d!=0))[0]) #getting the list of neighbors in the epsilon vicinity and removing distance = 0 (it's the point itself)
if len(neighbors)<N: #if the number of object is less than N, it's an outlier
noise.append(i)
else:
cluster = [i] #otherwise it forms a new cluster
for neighbor in neighbors: #iterating trough all the neighbors of the point i
if neighbor not in visited: #if neighbor isn't visited
visited.append(neighbor)
d = distances(data[neighbor], data) #get the distances to other objects from the neighbor
neighbors_idx = list(np.where((d<=epsilon)&(d!=0))[0]) #getting neighbors of the neighbor
if len(neighbors_idx)>=N: #if the neighbor has N or more neighbors, than it's a core point
neighbors += neighbors_idx #add neighbors of the neighbor to the neighbors of the ith object
if not any(neighbor in cluster for cluster in clusters):
cluster.append(neighbor) #if neighbor is not in clusters, add it there
clusters.append(cluster) #put the cluster into clusters listreturn clusters, noise

Done!

Let’s check the correctness of our implementation and compare it with sklearn.

Let’s generate some synthetic data:

X1 = [[x,y] for x, y in zip(np.random.normal(6,1, 2000), np.random.normal(0,0.5, 2000))]
X2 = [[x,y] for x, y in zip(np.random.normal(10,2, 2000), np.random.normal(6,1, 2000))]
X3 = [[x,y] for x, y in zip(np.random.normal(-2,1, 2000), np.random.normal(4,2.5, 2000))]fig, ax = plt.subplots()
ax.scatter([x[0] for x in X1], [y[1] for y in X1], s=40, c='#00b8ff', edgecolors='#133e7c', linewidth=0.5, alpha=0.8)
ax.scatter([x[0] for x in X2], [y[1] for y in X2], s=40, c='#00ff9f', edgecolors='#0abdc6', linewidth=0.5, alpha=0.8)
ax.scatter([x[0] for x in X3], [y[1] for y in X3], s=40, c='#d600ff', edgecolors='#ea00d9', linewidth=0.5, alpha=0.8)
ax.spines[['right', 'top', 'bottom', 'left']].set_visible(False)
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor('black')
ax.patch.set_alpha(0.7)

Let’s apply our implementation and visualize the results:

For sklearn implementation we got the same clusters:

That’s it, they are identical. 5 minutes and we’re done! When you try DBSCANning yourself, don’t forget to tune epsilon and the number of neighbors since they highlt influence the final results.

===========================================

Reference:

[1] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). Density-based spatial clustering of applications with noise. In Int. Conf. knowledge discovery and data mining (Vol. 240, №6).

[2] Yang, Yang, et al. “An efficient DBSCAN optimized by arithmetic optimization algorithm with opposition-based learning.” The journal of supercomputing 78.18 (2022): 19566–19604.

===========================================

All my publications on Medium are free and open-access, that’s why I’d really appreciate if you followed me here!

P.s. I’m extremely passionate about (Geo)Data Science, ML/AI and Climate Change. So if you want to work together on some project pls contact me in LinkedIn.

🛰️Follow for more🛰️

DBSCAN, Explained in 5 Minutes. Fastest implementation in python🐍 | by Aleksei Rozanov | Aug, 2024

Fastest implementation in python🐍

What does it mean?

Why do you need DBSCAN at all???

Recent Articles

Integrate dynamic web content in your generative AI application using a web search API and Amazon Bedrock Agents

The Future of (Privileged) Access Management

Apartment 7A Brings Atmospheric Backstory to a Celebrated Horror Tale

Quick Hit #21 | CSS-Tricks

Apache Airflow Day 7: Advanced DAG Concepts in Apache Airflow | by Anubhav | Sep, 2024

Related Stories

Leave A Reply Cancel reply