Book Image

Big Data Analytics with R and Hadoop

By : Vignesh Prajapati
Book Image

Big Data Analytics with R and Hadoop

By: Vignesh Prajapati

Overview of this book

<p>Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing. <br /><br />Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.<br /><br />You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming.</p>
Table of Contents (16 chapters)
Big Data Analytics with R and Hadoop
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
Preface
Index

Understanding Hadoop features


Hadoop is specially designed for two core concepts: HDFS and MapReduce. Both are related to distributed computation. MapReduce is believed as the heart of Hadoop that performs parallel processing over distributed data.

Let us see more details on Hadoop's features:

  • HDFS

  • MapReduce

Understanding HDFS

HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data storage layer of Hadoop. HDFS is derived from concepts of Google filesystem. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands of) hosts, and the execution of application computations in parallel, close to their data. On HDFS, data files are replicated as sequences of blocks in the cluster. A Hadoop cluster scales computation capacity, storage capacity, and I/O bandwidth by simply adding commodity servers. HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use.

The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of application data, with the largest Hadoop cluster being 4,000 servers. Also, one hundred other organizations worldwide are known to use Hadoop.

Understanding the characteristics of HDFS

Let us now look at the characteristics of HDFS:

  • Fault tolerant

  • Runs with commodity hardware

  • Able to handle large datasets

  • Master slave paradigm

  • Write once file access only

Understanding MapReduce

MapReduce is a programming model for processing large datasets distributed on a large cluster. MapReduce is the heart of Hadoop. Its programming paradigm allows performing massive data processing across thousands of servers configured with Hadoop clusters. This is derived from Google MapReduce.

Hadoop MapReduce is a software framework for writing applications easily, which process large amounts of data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. This MapReduce paradigm is divided into two phases, Map and Reduce that mainly deal with key and value pairs of data. The Map and Reduce task run sequentially in a cluster; the output of the Map phase becomes the input for the Reduce phase. These phases are explained as follows:

  • Map phase: Once divided, datasets are assigned to the task tracker to perform the Map phase. The data functional operation will be performed over the data, emitting the mapped key and value pairs as the output of the Map phase.

  • Reduce phase: The master node then collects the answers to all the subproblems and combines them in some way to form the output; the answer to the problem it was originally trying to solve.

The five common steps of parallel computing are as follows:

  1. Preparing the Map() input: This will take the input data row wise and emit key value pairs per rows, or we can explicitly change as per the requirement.

    • Map input: list (k1, v1)

  2. Run the user-provided Map() code

    • Map output: list (k2, v2)

  3. Shuffle the Map output to the Reduce processors. Also, shuffle the similar keys (grouping them) and input them to the same reducer.

  4. Run the user-provided Reduce() code: This phase will run the custom reducer code designed by developer to run on shuffled data and emit key and value.

    • Reduce input: (k2, list(v2))

    • Reduce output: (k3, v3)

  5. Produce the final output: Finally, the master node collects all reducer output and combines and writes them in a text file.

    Tip

    The reference links to review on Google filesystem can be found at http://research.google.com/archive/gfs.html and Google MapReduce can be found at http://research.google.com/archive/mapreduce.html.