Book Image

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar
Book Image

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.
Table of Contents (23 chapters)
Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Index

Presto – introduction


The growing popularity of big data use cases has bought many new technologies and frameworks each of them comes with scalability, high throughput, and low latency in mind. Some companies have very large data warehouses storing hundreds of petabytes of data, and the data is used for various applications such as machine learning, batch analytics, and more. The data is used by technical engineering teams to get insights into businesses, which helps improve the product or services and yields new opportunities to generate more revenue for companies. 

The performance of data warehouses plays an important role, as fast results will always help in quicker decision making. Data warehouses should have the ability to run queries in parallel and give results in less time to help businesses increase their productivity and profitability. It is also important to monitor the cost of the warehouse, which will also have an impact on the profitability of the organization. Hadoop came to...