Density-based Spatial Clustering of Applications with Noise (DBSCAN) and Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithms were the first approaches developed to handle noisy data effectively. Noise here is understood as data points that seem completely out of place when compared with the rest of the dataset; DBSCAN puts such observations into an unclassified bucket while BIRCH treats them as outliers and removes them from the dataset.
To execute this recipe, you will need pandas
and Scikit
. No other prerequisites are required.
Both the algorithms can be found in Scikit
. To use DBSCAN
, use the code found in the clustering_dbscan.py
file:
import sklearn.cluster as cl def findClusters_DBSCAN(data): ''' Cluster data using DBSCAN algorithm ''' # create the classifier object dbscan = cl.DBSCAN(eps=1.2, min_samples=200) # fit the data ...