Book Image

HDInsight Essentials - Second Edition

By : Rajesh Nadipalli
Book Image

HDInsight Essentials - Second Edition

By: Rajesh Nadipalli

Overview of this book

Table of Contents (16 chapters)
HDInsight Essentials Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Transformation for the OTP project


Let's take a look at a practical example of a transformation using our Airline Ontime Performance (OTP) project. Let's say that our transformation task is to get from a source stage table that we created in Chapter 5, Ingest and Organize Data Lake, (which is on the left-hand side) to aggregated (which is on the right-hand side) summary data by airline carrier, year, and month.

To achieve the preceding transformation, we need to perform the following key steps:

  1. Clean the header line in each file that has the field names.

  2. Update the flight month from the current "MM" to "YYYYMM" format.

  3. Create an intermediate table with the refined data from the previous two steps.

  4. Aggregate the data from the refined data to a summary table at flight year, flight month, and carrier levels.

We will use Pig to do a cleanup and then use Hive to preserve the results as shown in the following figure:

Cleaning data using Pig

The following is a Pig script (saved as Cleanotpraw.pig.txt)...