Book Image

R High Performance Programming

Book Image

R High Performance Programming

Overview of this book

Table of Contents (17 chapters)
R High Performance Programming
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Understanding Hadoop


Before we learn how to use Hadoop (for more information refer to http://hadoop.apache.org/) and related tools in R, we need to understand the basics of Hadoop. For our purposes, it suffices to know that Hadoop comprises two key components: the Hadoop Distributed File System (HDFS) and the MapReduce framework to execute data processing tasks. Hadoop includes many other components for task scheduling, job management, and others, but we shall not concern ourselves with those in this book.

HDFS, as the name suggests, is a virtual filesystem that is distributed across a cluster of servers. HDFS stores files in blocks, with a default block size of 128 MB. For example, a 1 GB file is split into eight blocks of 128 MB, which are distributed to different servers in the cluster. Furthermore, to prevent data loss due to server failure, the blocks are replicated. By default, they are replicated three times—there are three copies of each block of data in the cluster, and each copy...