Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Learning Hadoop 2

By : GABRIELE MODENA

3.8 (4)

Learning Hadoop 2

3.8 (4)

By: GABRIELE MODENA

Overview of this book

If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. You are expected to be familiar with the Unix/Linux command-line interface and have some experience with the Java programming language. Familiarity with Hadoop would be a plus.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Introduction

1. Introduction

A note on versioning

The background of Hadoop

Components of Hadoop

Hadoop 2 – what's the big deal?

Distributions of Apache Hadoop

A dual approach

AWS – infrastructure on demand from Amazon

Getting started

Running the examples

Data processing with Hadoop

Summary

2. Storage

2. Storage

The inner workings of HDFS

Command-line access to the HDFS filesystem

Protecting the filesystem metadata

Apache ZooKeeper – a different type of filesystem

Automatic NameNode failover

HDFS snapshots

Hadoop filesystems

Managing and serializing data

Storing data

Summary

3. Processing – MapReduce and Beyond

3. Processing – MapReduce and Beyond

MapReduce

Java API to MapReduce

Writing MapReduce programs

Walking through a run of a MapReduce job

YARN

YARN in the real world – Computation beyond MapReduce

Summary

4. Real-time Computation with Samza

4. Real-time Computation with Samza

Stream processing with Samza

Summary

5. Iterative Computation with Spark

5. Iterative Computation with Spark

Apache Spark

The Spark ecosystem

Processing data with Apache Spark

Comparing Samza and Spark Streaming

Summary

6. Data Analysis with Apache Pig

6. Data Analysis with Apache Pig

An overview of Pig

Getting started

Running Pig

Fundamentals of Apache Pig

Programming Pig

Extending Pig (UDFs)

Analyzing the Twitter stream

Summary

7. Hadoop and SQL

7. Hadoop and SQL

Why SQL on Hadoop

Prerequisites

Hive architecture

Hive and Amazon Web Services

Extending HiveQL

Programmatic interfaces

Stinger initiative

Impala

Summary

8. Data Lifecycle Management

8. Data Lifecycle Management

What data lifecycle management is

Building a tweet analysis capability

Challenges of external data

Collecting additional data

Pulling it all together

Summary

9. Making Development Easier

9. Making Development Easier

Choosing a framework

Hadoop streaming

Kite Data

Apache Crunch

Summary

10. Running a Hadoop Cluster

10. Running a Hadoop Cluster

I'm a developer – I don't care about operations!

Cloudera Manager

Ambari – the open source alternative

Operations in the Hadoop 2 world

Sharing resources

Building a physical cluster

Building a cluster on EMR

Cluster tuning

Security

Monitoring

Troubleshooting

Summary

11. Where to Go Next

11. Where to Go Next

Alternative distributions

Other computational frameworks

Other interesting projects

Other programming abstractions

AWS resources

Sources of information

Summary

Index

Index

Chapter 7. Hadoop and SQL

MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. As discussed in earlier chapters however, it does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This chapter will explore the other most common abstraction implemented atop Hadoop: SQL.

In this chapter, we will cover the following topics:

What the use cases for SQL on Hadoop are and why it is so popular
HiveQL, the SQL dialect introduced by Apache Hive
Using HiveQL to perform SQL-like analysis of the Twitter dataset
How HiveQL can approximate common features of relational databases such as joins and views
How HiveQL allows the incorporation of user-defined functions into its queries
How SQL on Hadoop...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Learning Hadoop 2

Search

Your notes and bookmarks