Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Loading and saving data from an arbitrary source


So far, we have covered three data sources that are inbuilt with DataFrames—parquet (default), json, and jdbc. Dataframes are not limited to these three and can load and save to any arbitrary data source by specifying the format manually.

In this recipe, we will cover loading and saving data from arbitrary sources.

How to do it...

  1. Start the Spark shell and give it some extra memory:

    $ spark-shell --driver-memory 1G
    
  2. Load the data from Parquet; since parquet is the default data source, you do not have to specify it:

    scala> val people = sqlContext.read.load("hdfs://localhost:9000/user/hduser/people.parquet") 
    
  3. Load the data from Parquet by manually specifying the format:

    scala> val people = sqlContext.read.format("org.apache.spark.sql.parquet").load("hdfs://localhost:9000/user/hduser/people.parquet") 
    
  4. For inbuilt datatypes (parquet,json, and jdbc), you do not have to specify the full format name, only specifying "parquet", "json", or "jdbc...