Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Distributed graph computation with GraphX


GraphX (https://spark.apache.org/graphx/) is a distributed graph processing library that is designed to work with Spark. Like the MLlib library we used in the previous chapter, GraphX provides a set of abstractions that are built on top of Spark's RDDs. By representing the vertices and edges of a graph as RDDs, GraphX is able to process very large graphs in a scalable way.

We've seen in previous chapters how to process a large dataset using MapReduce and Hadoop. Hadoop is an example of a data-parallel system: the dataset is divided into groups that are processed in parallel. Spark is also a data-parallel system: RDDs are distributed across the cluster and processed in parallel.

Data-parallel systems are appropriate ways of scaling data processing when your data closely resembles a table. Graphs, which may have complex internal structure, are not most efficiently represented as tables. Although graphs can be represented as edge lists, as we've seen...