Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Joining two datasets using Pig


This recipe explains how to join two datasets using Pig. We will use the BookCrossing dataset for this recipe. This recipe will use Pig to join the Books dataset with the Book-Ratings dataset and find the distribution of high ratings (with rating>3) with respect to authors.

How to do it...

This section describes how to use a Pig Latin script to find author's review rating distribution by joining the Books dataset with the Ratings dataset:

  1. Extract the BookCrossing sample dataset (chapter6-bookcrossing-data.tar.gz) from the chapter6 folder of the code repository.

  2. Create a directory in HDFS and copy the BookCrossing Books dataset and the Book-Ratings dataset to that directory, as follows:

    $ hdfs dfs –mkdir book-crossing
    $ hdfs dfs -copyFromLocal \
    chapter6/data/BX-Books-Prepro.txt book-crossing
    $ hdfs dfs -copyFromLocal \
    BX-Book-Ratings-Prepro.txt book-crossing
    
  3. Review the chapter7/pig-scripts/book-ratings-join.pig script.

  4. Execute the preceding Pig Latin script...