Book Image

Hadoop Essentials

By : Shiva Achari
Book Image

Hadoop Essentials

By: Shiva Achari

Overview of this book

This book jumps into the world of Hadoop and its tools, to help you learn how to use them effectively to optimize and improve the way you handle Big Data. Starting with the fundamentals Hadoop YARN, MapReduce, HDFS, and other vital elements in the Hadoop ecosystem, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also explore a number of the leading data processing tools including Hive and Pig, and learn how to use Sqoop and Flume, two of the most powerful technologies used for data ingestion. With further guidance on data streaming and real-time analytics with Storm and Spark, Hadoop Essentials is a reliable and relevant resource for anyone who understands the difficulties - and opportunities - presented by Big Data today. With this guide, you'll develop your confidence with Hadoop, and be able to use the knowledge and skills you learn to successfully harness its unparalleled capabilities.
Table of Contents (15 chapters)
Hadoop Essentials
About the Author
About the Reviewers
Pillars of Hadoop – HDFS, MapReduce, and YARN

Big data use case patterns

There are many technological scenarios, and some of them are similar in pattern. It is a good idea to map scenarios with architectural patterns. Once these patterns, are understood, they become the fundamental building blocks of solutions. We will discuss five types of patterns in the following section.


This solution is not always optimized, and it may depend on domain data, type of data, or some other factors. These examples are to visualize a problem and they can help to find a solution.

Big data as a storage pattern

Big data systems can be used as a storage pattern or as a data warehouse, where data from multiple sources, even with different types of data, can be stored and can be utilized later. The usage scenario and use case are as follows:

  • Usage scenario:

    • Data getting continuously generated in large volumes

    • Need for preprocessing before getting loaded into the target system

  • Use case:

    • Machine data capture for subsequent cleansing can be merged in multiple or single big file(s) and can be loaded in a Hadoop to compute

    • Unstructured data across multiple sources should be captured for subsequent analysis on emerging patterns

    • Data loaded in Hadoop should be processed and filtered, and depending on the data, we can have the storage as a data warehouse, Hadoop, or any NoSQL system.

The storage pattern is shown in the following figure:

Big data as a data transformation pattern

Big data systems can be designed to perform transformation as the data loading and cleansing activity, and many transformations can be done faster than traditional systems due to parallelism. Transformation is one phase in the Extract–Transform–Load of data ingestion and cleansing phase. The usage scenario and use case are as follows:

  • Usage scenario

    • A large volume of raw data to be preprocessed

    • Data type includes structured as well as non-structured data

  • Use case

    • Evolution of ETL (Extract–Transform–Load) tools to leverage big data, for example, Pentaho, Talend, and so on. Also, in Hadoop, ELT (Extract–Load–Transform) is also trending, as the loading will be faster in Hadoop, and cleansing can run a parallel process to clean and transform the input, which will be faster

The data transformation pattern is shown in the following figure:

Big data for a data analysis pattern

Data analytics is of wider interest in big data systems, where a huge amount of data can be analyzed to generate statistical reports and insights about the data, which can be useful in business and understanding of patterns. The usage scenario and use case are as follows:

  • Usage scenario

    • Improved response time for detection of patterns

    • Data analysis for non-structured data

  • Use case

    • Fast turnaround for machine data analysis (for example, analysis of seismic data)

    • Pattern detection across structured and non-structured data (for example, fraud analysis)

Big data for data in a real-time pattern

Big data systems integrating with some streaming libraries and systems are capable of handling high scale real-time data processing. Real-time processing for a large and complex requirement possesses a lot of challenges such as performance, scalability, availability, resource management, low latency, and so on. Some streaming technologies such as Storm and Spark Streaming can be integrated with YARN. The usage scenario and use case are as follows:

  • Usage scenario

    • Managing the action to be taken based on continuously changing data in real time

  • Use case

    • Automated process control based on real time from manufacturing equipments

    • Real-time changes to plant operations based on events from business systems Enterprise Resource Planning (ERPs)

The data in a real-time pattern is shown in the following figure:

Big data for a low latency caching pattern

Big data systems can be tuned as a special case for a low latency system, where reads are much higher and updates are low, which can fetch the data faster and can be stored in memory, which can further improve the performance and avoid overheads. The usage scenario and use case are as follows:

  • Usage scenario

    • Reads are far higher in ratio to writes

    • Reads require very low latency and a guaranteed response

    • Distributed location-based data caching

  • Use case

    • Order promising solutions

    • Cloud-based identity and SSO

    • Low latency real-time personalized offers on mobile

The low latency caching pattern is shown in the following pattern:

Some of the technology stacks that are widely used according to the layer and framework are shown in the following image: