So far, we have covered three data sources that are inbuilt with DataFrames—parquet
(default), json
, and jdbc
. Dataframes are not limited to these three and can load and save to any arbitrary data source by specifying the format manually.
In this recipe, we will cover loading and saving data from arbitrary sources.
Start the Spark shell and give it some extra memory:
$ spark-shell --driver-memory 1G
Load the data from Parquet; since
parquet
is the default data source, you do not have to specify it:scala> val people = sqlContext.read.load("hdfs://localhost:9000/user/hduser/people.parquet")
Load the data from Parquet by manually specifying the format:
scala> val people = sqlContext.read.format("org.apache.spark.sql.parquet").load("hdfs://localhost:9000/user/hduser/people.parquet")
For inbuilt datatypes (
parquet
,json
, andjdbc
), you do not have to specify the full format name, only specifying"parquet"
,"json"
, or"jdbc...