Book Image

Big Data Analytics with Hadoop 3

By : Sridhar Alla
Book Image

Big Data Analytics with Hadoop 3

By: Sridhar Alla

Overview of this book

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples. Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases. By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.
Table of Contents (18 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
4
Scientific Computing and Big Data Analysis with Python and Hadoop
Index

Preface

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.

Once you have taken a tour of Hadoop 3's latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you become acquainted with all of this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions in the cloud and an end-to-end pipeline to perform big data analysis using practical use cases.

By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insights effortlessly.

Who this book is for

Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3's powerful features, or if you’re new to big data analytics. A basic understanding of the Java programming language is required.

What this book covers

Chapter 1, Introduction to Hadoop, introduces you to the world of Hadoop and its core components, namely, HDFS and MapReduce.

Chapter 2, Overview of Big Data Analytics, introduces the process of examining large datasets to uncover patterns in data, generating reports, and gathering valuable insights.

Chapter 3, Big Data Processing with MapReduce, introduces the concept of MapReduce, which is the fundamental concept behind most of the big data computing/processing systems.

Chapter 4, Scientific Computing and Big Data Analysis with Python and Hadoop, provides an introduction to Python and an analysis of big data using Hadoop with the aid of Python packages.

Chapter 5, Statistical Big Data Computing with R and Hadoop, provides an introduction to R and demonstrates how to use R to perform statistical computing on big data using Hadoop.

Chapter 6, Batch Analytics with Apache Spark, introduces you to Apache Spark and demonstrates how to use Spark for big data analytics based on a batch processing model.

Chapter 7, Real-Time Analytics with Apache Spark, introduces the stream processing model of Apache Spark and demonstrates how to build streaming-based, real-time analytical applications.

Chapter 8, Batch Analytics with Apache Flink, covers Apache Flink and how to use it for big data analytics based on a batch processing model.

Chapter 9, Stream Processing with Apache Flink, introduces you to DataStream APIs and stream processing using Flink. Flink will be used to receive and process real-time event streams and store the aggregates and results in a Hadoop cluster.

Chapter 10, Visualizing Big Data, introduces you to the world of data visualization using various tools and technologies such as Tableau.

Chapter 11, Introduction to Cloud Computing, introduces Cloud computing and various concepts such as IaaS, PaaS, and SaaS. You will also get a glimpse into the top Cloud providers.

Chapter 12, Using Amazon Web Services, introduces you to AWS and various services in AWS useful for performing big data analytics using Elastic Map Reduce (EMR) to set up a Hadoop cluster in AWS Cloud.

To get the most out of this book

The examples have been implemented using Scala, Java, R, and Python on a Linux 64-bit. You will also need, or be prepared to install, the following on your machine (preferably the latest version):

  • Spark 2.3.0 (or higher)
  • Hadoop 3.1 (or higher)
  • Flink 1.4
  • Java (JDK and JRE) 1.8+
  • Scala 2.11.x (or higher)
  • Python 2.7+/3.4+
  • R 3.1+ and RStudio1.0.143 (or higher)
  • Eclipse Mars or Idea IntelliJ (latest)

Regarding the operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and, to be more specific, for example, as regards Ubuntu, it is recommended having a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can also run code on Windows (XP/7/8/10) or macOS X (10.4.7+).

Regarding hardware configuration: Processor Core i3, Core i5 (recommended) ~ Core i7 (to get the best result). However, multicore processing would provide faster data processing and scalability. At least 8 GB RAM (recommended) for a standalone mode. At least 32 GB RAM for a single VM and higher for cluster. Enough storage for running heavy jobs (depending on the dataset size you will be handling) preferably at least 50 GB of free disk storage (for stand alone and SQL warehouse).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at www.packtpub.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Big-Data-Analytics-with-Hadoop-3. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/BigDataAnalyticswithHadoop3_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "This file, temperatures.csv, is available as a download and once downloaded, you can move it into hdfs by running the command, as shown in the following code."

A block of code is set as follows:

hdfs dfs -copyFromLocal temperatures.csv /user/normal

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

Map-Reduce Framework -- output average temperature per city name
Map input records=35
    Map output records=33
    Map output bytes=208
    Map output materialized bytes=286

Any command-line input or output is written as follows:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Clicking on the Datanodes tab shows all the nodes."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.