Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Loading data from HDFS using a custom InputFormat


Sometimes you need to load data in a specific format and TextInputFormat is not a good fit for that. Spark provides two methods for this purpose:

  • sparkContext.hadoopFile: This supports the old MapReduce API

  • sparkContext.newAPIHadoopFile: This supports the new MapReduce API

These two methods provide support for all of Hadoop's built-in InputFormats interfaces as well as any custom InputFormat.

How to do it...

We are going to load text data in key-value format and load it in Spark using KeyValueTextInputFormat:

  1. Create the currency directory by using the following command:

    $ mkdir currency
  2. Change the current directory to currency:

    $ cd currency
  3. Create the na.txt text file and enter currency values in key-value format delimited by tab (key: country, value: currency):

    $ vi na.txt
    United States of America        US Dollar
    Canada  Canadian Dollar
    Mexico  Peso
    

    You can create more files for each continent.

  4. Upload the currency folder to HDFS:

    $ hdfs dfs -put...