As everything in the Spark world has moved to DataFrames, it is natural to wonder how GraphX is still RDD based. This is where GraphFrames comes into the picture. GraphFrames is still not directly included in the Spark library and is being developed separately as a Spark package. It is just a matter of time before it is considered stable enough to be included in the main API.
In this recipe, we will understand GraphFrames. The GraphFrames has two primary DataFrames:
- The vertices DataFrame, which needs to have a mandatory column called id
- The edges DataFrame, which needs to have two mandatory columns, src and dst
Besides these requirements, both the vertices and edges DataFrames can have any arbitrary number and type of columns to represent attributes.