Apache Spark Graph Processing

Clustering is a learning problem in which given entities, such as objects or people, are partitioned into subsets, according to a defined similarity measure. The entities within the same cluster are very similar, and are different from all entities in other clusters. Clustering is done with an unsupervised method. In other words, it operates on unlabeled data, which are the attributes or features of the entities. Moreover, clustering methods can be broadly classified into parametric versus non parametric approaches. The parametric approaches impose a probability model on the data. Some examples of the parametric methods are Gaussian Mixture Model (GMM) and Latent Dirichlet Allocation (LDA). On the other hand, the non parametric models infer the structure of the clusters from the data itself. Examples include k-means and spectral clustering. All these cited methods are available in Spark's MLlib library.

Before we continue, it is important to understand why...

Apache Spark Graph Processing

Apache Spark Graph Processing

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark Graph Processing

Community clustering in graphs