Book Image

Apache Hadoop 3 Quick Start Guide

By : Hrishikesh Vijay Karambelkar
Book Image

Apache Hadoop 3 Quick Start Guide

By: Hrishikesh Vijay Karambelkar

Overview of this book

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics, including MapReduce, YARN, and HDFS. The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel programming paradigm, such as MapReduce, can solve many complex data processing problems. The book also covers the important aspects of the big data software development lifecycle, including quality assurance and control, performance, administration, and monitoring. You will then learn about the Hadoop ecosystem, and tools such as Kafka, Sqoop, Flume, Pig, Hive, and HBase. Finally, you will look at advanced topics, including real time streaming using Apache Storm, and data analytics using Apache Spark. By the end of the book, you will be well versed with different configurations of the Hadoop 3 cluster.
Table of Contents (10 chapters)

Hadoop 3.0 - Background and Introduction

"There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days."
– Eric Schmidt of Google, 2010

The world is evolving day by day, from automated call assistance to smart devices taking intelligent decisions, to self-driven decision-making cars to humanoid robots, all driven by processing large amount of data and analyzing it. We are rapidly approaching to the new era of data age. The IDC whitepaper (https://www.seagate.com/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf) on data evolution published in 2017 predicts data volumes to reach 163 zettabytes (1 zettabyte = 1 trillion terabytes) by the year 2025. This will involve digitization of all the analog data that we see between now and then. This flood of data will come from a broad variety of different device types, including IoT devices (sensor data) from industrial plants as well as home devices, smart meters, social media, wearables, mobile phones, and so on.

In our day-to-day life, we have seen ourselves participating in this evolution. For example, I started using a mobile phone in 2000 and, at that time, it had basic functions such as calls, torch, radio, and SMS. My phone could barely generate any data as such. Today, I use a 4G LTE smartphone capable of transmitting GBs of data including my photos, navigation history, and my health parameters from my smartwatch, on different devices over the internet. This data is effectively being utilized to make smart decisions.

Let's look at some real-world examples of big data:

  • Companies such as Facebook and Instagram are using face recognition tools to identify photos, classify them, and bring you friend suggestions by comparison
  • Companies such as Google and Amazon are looking at human behavior based on navigation patterns and location data, providing automated recommendations for shopping
  • Many government organizations are analyzing information from CCTV cameras, social media feeds, network traffic, phone data, and bookings to trace criminals and predict potential threats and terrorist attacks
  • Companies are using sentiment analysis from message posts and tweets to improve the quality of their products, as well as brand equities, and have targeted business growth
  • Every minute, we send 204 million emails, view 20 million photos on Flickr, perform 2 million searches on Google, and generate 1.8 million likes on Facebook (Source)

With this data growth, the demands to process, store, and analyze data in a faster and scalable manner will arise. So, the question is: are we ready to accommodate these demands? Year after year, computer systems have evolved and so has storage media in terms of capacities; however, the capability to read-write byte data is yet to catch up with these demands. Similarly, data coming from various sources and various forms needs to be correlated together to make meaningful information. For example, with a combination of my mobile phone location information, billing information, and credit card details, someone can derive my interests in food, social status, and financial strength. The good part is that we see a lot of potential of working with big data. Today, companies are barely scratching the surface; however, we are still struggling to deal with storage and processing problems unfortunately.

This chapter is intended to provide the necessary background for you to get started on Apache Hadoop. It will cover the following key topics:

  • How it all started
  • What Apache Hadoop is and why it is important
  • How Apache Hadoop works
  • Hadoop 3.0 releases and new features
  • Choosing the right Hadoop distribution