As everything in the Spark world has moved to DataFrames, it is natural to wonder how GraphX is still RDD based. This is where GraphFrames comes into the picture. GraphFrames is still not directly included in the Spark library and is being developed separately as a Spark package. It is just a matter of time before it is considered stable enough to be included in the main API.
In this recipe, we will understand GraphFrames. The GraphFrames has two primary DataFrames:
- The vertices DataFrame, which needs to have a mandatory column called
id
- The edges DataFrame, which needs to have two mandatory columns,
src
anddst
- The edges DataFrame, which needs to have two mandatory columns,
Besides these requirements, both the vertices
and edges
DataFrames can have any arbitrary number and type of columns to represent attributes.