So far, we have covered five data sources that are built-in with DataFrames: Parquet
(default), text
, json
, csv
, and jdbc
. DataFrames are not limited to these five and can load and save to any arbitrary data source by specifying the format manually.
In this recipe, we will cover the loading and saving of data from arbitrary sources.
- Start the Spark shell:
$ spark-shell
- Load the data from Parquet; since
parquet
is the default data source, you do not have to specify it:
scala> val people = spark.read.load("hdfs://localhost:
9000/user/hduser/people.parquet")
- Load the data from
parquet
by manually specifying the format:
scala> val people = spark.read.format("parquet").load
("hdfs://localhost:9000/user/hduser/people.parquet")
- For inbuilt datatypes, you do not have to specify the full format name; only specifying
"parquet"
,"json"
, or"jdbc"
would work:
scala> val people = spark.read.format("parquet").load
...