GraphX does most of the computation by isolating each vertex and its neighbors. It makes it easier to process the massive graph data on distributed systems. This makes the neighborhood operations very important. GraphX has a mechanism to do it at each neighborhood level in the form of the aggregateMessages
method. It does it in two steps:
In the first step (first function of the method), messages are send to the destination vertex or source vertex (similar to the Map function in MapReduce).
In the second step (second function of the method), aggregation is done on these messages (similar to the Reduce function in MapReduce).
Let's build a small dataset of the followers:
Follower |
Followee |
---|---|
John |
Barack |
Pat |
Barack |
Gary |
Barack |
Chris |
Mitt |
Rob |
Mitt |
Our goal is to find out how many followers each node has. Let's load this data in the form of two files: nodes.csv
and edges.csv
.
The following is the content of nodes.csv
:
1,Barack 2,John 3,Pat...