Learning Hadoop 2

Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Learning Hadoop 2

Learning Hadoop 2

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction

A note on versioning

The background of Hadoop

Components of Hadoop

Hadoop 2 – what's the big deal?

Distributions of Apache Hadoop

A dual approach

AWS – infrastructure on demand from Amazon

Getting started

Running the examples

Data processing with Hadoop

Storage

The inner workings of HDFS

Command-line access to the HDFS filesystem

Protecting the filesystem metadata

Apache ZooKeeper – a different type of filesystem

Automatic NameNode failover

Hadoop filesystems

Managing and serializing data

Processing – MapReduce and Beyond

Processing – MapReduce and Beyond

Java API to MapReduce

Writing MapReduce programs

Walking through a run of a MapReduce job

YARN in the real world – Computation beyond MapReduce

Real-time Computation with Samza

Real-time Computation with Samza

Stream processing with Samza

Iterative Computation with Spark

Iterative Computation with Spark

The Spark ecosystem

Processing data with Apache Spark

Comparing Samza and Spark Streaming

Data Analysis with Apache Pig

Data Analysis with Apache Pig

An overview of Pig

Getting started

Fundamentals of Apache Pig

Programming Pig

Extending Pig (UDFs)

Analyzing the Twitter stream

Hadoop and SQL

Why SQL on Hadoop

Hive architecture

Hive and Amazon Web Services

Extending HiveQL

Programmatic interfaces

Stinger initiative

Data Lifecycle Management

Data Lifecycle Management

What data lifecycle management is

Building a tweet analysis capability

Challenges of external data

Collecting additional data

Pulling it all together

Making Development Easier

Making Development Easier

Choosing a framework

Hadoop streaming

Running a Hadoop Cluster

Running a Hadoop Cluster

I'm a developer – I don't care about operations!

Cloudera Manager

Ambari – the open source alternative

Operations in the Hadoop 2 world

Sharing resources

Building a physical cluster

Building a cluster on EMR

Troubleshooting

Where to Go Next

Where to Go Next

Alternative distributions

Other computational frameworks

Other interesting projects

Other programming abstractions

Sources of information

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Running the examples

The source code of all examples is available at https://github.com/learninghadoop2/book-examples.

Gradle (http://www.gradle.org/) scripts and configurations are provided to compile most of the Java code. The gradlew script included with the example will bootstrap Gradle and use it to fetch dependencies and compile code.

JAR files can be created by invoking the jar task via a gradlew script, as follows:

./gradlew jar

Jobs are usually executed by submitting a JAR file using the hadoop jar command, as follows:

$ hadoop jar example.jar <MainClass> [-libjars $LIBJARS] arg1 arg2 … argN

The optional -libjars parameter specifies runtime third-party dependencies to ship to remote nodes.

Note

Some of the frameworks we will work with, such as Apache Spark, come with their own build and package management tools. Additional information and resources will be provided for these particular cases.

The copyJar Gradle task can be used to download third-party dependencies into build/libjars/<example>/lib, as follows:

./gradlew copyJar

For convenience, we provide a fatJar Gradle task that bundles the example classes and their dependencies into a single JAR file. Although this approach is discouraged in favor of using –libjar, it might come in handy when dealing with dependency issues.

The following command will generate build/libs/<example>-all.jar:

$ ./gradlew fatJar