Hadoop Essentials

Hadoop Essentials

By : Shiva Achari

Buy this Book

Hadoop Essentials

By: Shiva Achari

Buy this Book

Overview of this book

This book jumps into the world of Hadoop and its tools, to help you learn how to use them effectively to optimize and improve the way you handle Big Data. Starting with the fundamentals Hadoop YARN, MapReduce, HDFS, and other vital elements in the Hadoop ecosystem, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also explore a number of the leading data processing tools including Hive and Pig, and learn how to use Sqoop and Flume, two of the most powerful technologies used for data ingestion. With further guidance on data streaming and real-time analytics with Storm and Spark, Hadoop Essentials is a reliable and relevant resource for anyone who understands the difficulties - and opportunities - presented by Big Data today. With this guide, you'll develop your confidence with Hadoop, and be able to use the knowledge and skills you learn to successfully harness its unparalleled capabilities.

Hadoop Essentials

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Introduction to Big Data and Hadoop

V's of big data

Understanding big data

Who is creating big data?

Big data use case patterns

Hadoop

Pillars of Hadoop

Data access components

Data storage component

Data ingestion in Hadoop

Streaming and real-time analysis

Summary

Hadoop Ecosystem

Traditional systems

The Hadoop use cases

Hadoop's basic data flow

Hadoop integration

The Hadoop ecosystem

Distributed filesystem

Distributed programming

Data analytics and machine learning

System management

Summary

Pillars of Hadoop – HDFS, MapReduce, and YARN

HDFS

MapReduce

YARN

Summary

Data Access Components – Hive and Pig

Need of a data processing tool on Hadoop

Pig

Hive

Summary

Storage Component – HBase

An Overview of HBase

Advantages of HBase

The Architecture of HBase

HBase Hive integration

Performance tuning

Summary

Data Ingestion in Hadoop – Sqoop and Flume

Data sources

Challenges in data ingestion

Sqoop

Connectors and drivers

Sqoop 1 architecture

Sqoop 2 architecture

Imports

Exports

Apache Flume

Flume architecture

Examples of configuring Flume

Summary

Streaming and Real-time Analysis – Storm and Spark

An introduction to Storm

Storm topology

Storm on YARN

Topology configuration example

An introduction to Spark

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Big data use case patterns

There are many technological scenarios, and some of them are similar in pattern. It is a good idea to map scenarios with architectural patterns. Once these patterns, are understood, they become the fundamental building blocks of solutions. We will discuss five types of patterns in the following section.

Note

This solution is not always optimized, and it may depend on domain data, type of data, or some other factors. These examples are to visualize a problem and they can help to find a solution.

Big data as a storage pattern

Big data systems can be used as a storage pattern or as a data warehouse, where data from multiple sources, even with different types of data, can be stored and can be utilized later. The usage scenario and use case are as follows:

Usage scenario:
- Data getting continuously generated in large volumes
- Need for preprocessing before getting loaded into the target system
Use case:
- Machine data capture for subsequent cleansing can be merged in multiple or single big file(s) and can be loaded in a Hadoop to compute
- Unstructured data across multiple sources should be captured for subsequent analysis on emerging patterns
- Data loaded in Hadoop should be processed and filtered, and depending on the data, we can have the storage as a data warehouse, Hadoop, or any NoSQL system.

The storage pattern is shown in the following figure:

Big data as a data transformation pattern

Big data systems can be designed to perform transformation as the data loading and cleansing activity, and many transformations can be done faster than traditional systems due to parallelism. Transformation is one phase in the Extract–Transform–Load of data ingestion and cleansing phase. The usage scenario and use case are as follows:

Usage scenario
- A large volume of raw data to be preprocessed
- Data type includes structured as well as non-structured data
Use case
- Evolution of ETL (Extract–Transform–Load) tools to leverage big data, for example, Pentaho, Talend, and so on. Also, in Hadoop, ELT (Extract–Load–Transform) is also trending, as the loading will be faster in Hadoop, and cleansing can run a parallel process to clean and transform the input, which will be faster

The data transformation pattern is shown in the following figure:

Big data for a data analysis pattern

Data analytics is of wider interest in big data systems, where a huge amount of data can be analyzed to generate statistical reports and insights about the data, which can be useful in business and understanding of patterns. The usage scenario and use case are as follows:

Usage scenario
- Improved response time for detection of patterns
- Data analysis for non-structured data
Use case
- Fast turnaround for machine data analysis (for example, analysis of seismic data)
- Pattern detection across structured and non-structured data (for example, fraud analysis)

Big data for data in a real-time pattern

Big data systems integrating with some streaming libraries and systems are capable of handling high scale real-time data processing. Real-time processing for a large and complex requirement possesses a lot of challenges such as performance, scalability, availability, resource management, low latency, and so on. Some streaming technologies such as Storm and Spark Streaming can be integrated with YARN. The usage scenario and use case are as follows:

Usage scenario
- Managing the action to be taken based on continuously changing data in real time
Use case
- Automated process control based on real time from manufacturing equipments
- Real-time changes to plant operations based on events from business systems Enterprise Resource Planning (ERPs)

The data in a real-time pattern is shown in the following figure:

Big data for a low latency caching pattern

Big data systems can be tuned as a special case for a low latency system, where reads are much higher and updates are low, which can fetch the data faster and can be stored in memory, which can further improve the performance and avoid overheads. The usage scenario and use case are as follows:

Usage scenario
- Reads are far higher in ratio to writes
- Reads require very low latency and a guaranteed response
- Distributed location-based data caching
Use case
- Order promising solutions
- Cloud-based identity and SSO
- Low latency real-time personalized offers on mobile

The low latency caching pattern is shown in the following pattern:

Some of the technology stacks that are widely used according to the layer and framework are shown in the following image:

Hadoop Essentials

By : Shiva Achari

Hadoop Essentials

By: Shiva Achari

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop Essentials

Big data use case patterns

Note

Big data as a storage pattern

Big data as a data transformation pattern

Big data for a data analysis pattern

Big data for data in a real-time pattern

Big data for a low latency caching pattern