GraphX does most of the computation by isolating each vertex and its neighbors. It makes it easier to process the massive graph data on distributed systems. This makes the neighborhood operations very important. GraphX has a mechanism to do it at each neighborhood level in the form of the aggregateMessages
method. It does it in two steps:
- In the first step (the first function of the method), messages are sent to the destination vertex or source vertex (similar to the
Map
function in MapReduce). - In the second step (the second function of the method), aggregation is done on these messages (similar to the
Reduce
function in MapReduce).
Let's build a small dataset of the followers:
Follower | Followee |
John | Barack |
Pat | Barack |
Gary | Barack |
Chris | Mitt |
Rob | Mitt |
Our goal is to find out how many followers each node has. Let's load this data in the form of two files: nodes.csv
and edges.csv
.
The following is the content of nodes.csv
:
1,Barack 2,John 3,Pat 4,Gary 5,Mitt...