Book Image

Apache Spark Graph Processing

Book Image

Apache Spark Graph Processing

Overview of this book

Table of Contents (16 chapters)
Apache Spark Graph Processing
Credits
Foreword
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Community clustering in graphs


Clustering is a learning problem in which given entities, such as objects or people, are partitioned into subsets, according to a defined similarity measure. The entities within the same cluster are very similar, and are different from all entities in other clusters. Clustering is done with an unsupervised method. In other words, it operates on unlabeled data, which are the attributes or features of the entities. Moreover, clustering methods can be broadly classified into parametric versus non parametric approaches. The parametric approaches impose a probability model on the data. Some examples of the parametric methods are Gaussian Mixture Model (GMM) and Latent Dirichlet Allocation (LDA). On the other hand, the non parametric models infer the structure of the clusters from the data itself. Examples include k-means and spectral clustering. All these cited methods are available in Spark's MLlib library.

Before we continue, it is important to understand why...