Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Prerequisites


Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this chapter. We'll create a modified version of a former Pig script as the main functionality for this. The script in this chapter assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows:

-- load JSON data
tweets = load '$inputDir' using  com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
-- Tweets
tweets_tsv = foreach tweets {
generate 
    (chararray)CustomFormatToISO($0#'created_at', 
'EEE MMMM d HH:mm:ss Z y') as dt, 
    (chararray)$0#'id_str', 
(chararray)$0#'text' as text, 
    (chararray)$0#'in_reply_to', 
(boolean)$0#'retweeted' as is_retweeted, 
(chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id;
}
store tweets_tsv...