Big Data Analytics with R and Hadoop

Big Data Analytics with R and Hadoop

By : Vignesh Prajapati

Buy this Book

Big Data Analytics with R and Hadoop

By: Vignesh Prajapati

Buy this Book

Overview of this book

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing. Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop. You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming.

Big Data Analytics with R and Hadoop

Credits

About the Author

Acknowledgment

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Ready to Use R and Hadoop

Installing R

Installing RStudio

Understanding the features of R language

Installing Hadoop

Understanding Hadoop features

Learning the HDFS and MapReduce architecture

Understanding Hadoop subprojects

Summary

Writing Hadoop MapReduce Programs

Understanding the basics of MapReduce

Introducing Hadoop MapReduce

Understanding the Hadoop MapReduce fundamentals

Writing a Hadoop MapReduce example

Learning the different ways to write Hadoop MapReduce in R

Summary

Integrating R and Hadoop

Introducing RHIPE

Introducing RHadoop

Summary

Using Hadoop Streaming with R

Understanding the basics of Hadoop streaming

Understanding how to run Hadoop streaming with R

Exploring the HadoopStreaming R package

Summary

Learning Data Analytics with R and Hadoop

Understanding the data analytics project life cycle

Understanding data analytics problems

Summary

Understanding Big Data Analysis with Machine Learning

Introduction to machine learning

Supervised machine-learning algorithms

Unsupervised machine learning algorithm

Recommendation algorithms

Summary

Importing and Exporting Data from Various DBs

Learning about data files as database

Understanding MySQL

Understanding Excel

Understanding MongoDB

Understanding SQLite

Understanding PostgreSQL

Understanding Hive

Understanding HBase

Summary

References

R + Hadoop help materials

R groups

Hadoop groups

R + Hadoop groups

Popular R contributors

Popular Hadoop contributors

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Understanding Hadoop subprojects

Mahout is a popular data mining library. It takes the most popular data mining scalable machine learning algorithms for performing clustering, classification, regression, and statistical modeling to prepare intelligent applications. Also, it is a scalable machine-learning library.

Apache Mahout is distributed under a commercially friendly Apache software license. The goal of Apache Mahout is to build a vibrant, responsive, and diverse community to facilitate discussions not only on the project itself but also on potential use cases.

The following are some companies that are using Mahout:

Amazon: This a shopping portal for providing personalization recommendation
AOL: This is a shopping portal for shopping recommendations
Drupal: This is a PHP content management system using Mahout for providing open source content-based recommendation
iOffer: This is a shopping portal, which uses Mahout's Frequent Pattern Set Mining and collaborative filtering to recommend items to users
LucidWorks Big Data: This is a popular analytics firm, which uses Mahout for clustering, duplicate document detection, phase extraction, and classification
Radoop: This provides a drag-and-drop interface for Big Data analytics, including Mahout clustering and classification algorithms
Twitter: This is a social networking site, which uses Mahout's Latent Dirichlet Allocation (LDA) implementation for user interest modeling and maintains a fork of Mahout on GitHub.
Yahoo!: This is the world's most popular web service provider, which uses Mahout's Frequent Pattern Set Mining for Yahoo! Mail
Tip
The reference links on the Hadoop ecosystem can be found at http://www.revelytix.com/?q=content/hadoop-ecosystem.

Apache HBase is a distributed Big Data store for Hadoop. This allows random, real-time read/write access to Big Data. This is designed as a column-oriented data storage model innovated after inspired by Google BigTable.

The following are the companies using HBase:

Yahoo!: This is the world's popular web service provider for near duplicate document detection
Twitter: This is a social networking site for version control storage and retrieval
Mahalo: This is a knowledge sharing service for similar content recommendation
NING: This is a social network service provider for real-time analytics and reporting
StumbleUpon: This is a universal personalized recommender system, real-time data storage, and data analytics platform
Veoh: This is an online multimedia content sharing platform for user profiling system
Tip
For Google Big Data, distributed storage system for structured data, refer the link http://research.google.com/archive/bigtable.html.

Hive is a Hadoop-based data warehousing like framework developed by Facebook. It allows users to fire queries in SQL-like languages, such as HiveQL, which are highly abstracted to Hadoop MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools for real-time query processing.

Pig is a Hadoop-based open source platform for analyzing the large scale datasets via its own SQL-like language: Pig Latin. This provides a simple operation and programming interface for massive, complex data-parallelization computation. This is also easier to develop; it's more optimized and extensible. Apache Pig has been developed by Yahoo!. Currently, Yahoo! and Twitter are the primary Pig users.

For developers, the direct use of Java APIs can be tedious or error-prone, but also limits the Java programmer's use of Hadoop programming's flexibility. So, Hadoop provides two solutions that enable making Hadoop programming for dataset management and dataset analysis with MapReduce easier—these are Pig and Hive, which are always confusing.

Apache Sqoop provides Hadoop data processing platform and relational databases, data warehouse, and other non-relational databases quickly transferring large amounts of data in a new way. Apache Sqoop is a mutual data tool for importing data from the relational databases to Hadoop HDFS and exporting data from HDFS to relational databases.

It works together with most modern relational databases, such as MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and IBM DB2, and enterprise data warehouse. Sqoop extension API provides a way to create new connectors for the database system. Also, the Sqoop source comes up with some popular database connectors. To perform this operation, Sqoop first transforms the data into Hadoop MapReduce with some logic of database schema creation and transformation.

Apache Zookeeper is also a Hadoop subproject used for managing Hadoop, Hive, Pig, HBase, Solr, and other projects. Zookeeper is an open source distributed applications coordination service, which is designed with Fast Paxos algorithm-based synchronization and configuration and naming services such as maintenance of distributed applications. In programming, Zookeeper design is a very simple data model style, much like the system directory tree structure.

Zookeeper is divided into two parts: the server and client. For a cluster of Zookeeper servers, only one acts as a leader, which accepts and coordinates all rights. The rest of the servers are read-only copies of the master. If the leader server goes down, any other server can start serving all requests. Zookeeper clients are connected to a server on the Zookeeper service. The client sends a request, receives a response, accesses the observer events, and sends a heartbeat via a TCP connection with the server.

For a high-performance coordination service for distributed applications, Zookeeper is a centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services. All these kinds of services are used in some form or another by distributed applications. Each time they are implemented, there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. These services lead to management complexity when the applications are deployed.

Apache Solr is an open source enterprise search platform from the Apache license project. Apache Solr is highly scalable, supporting distributed search and index replication engine. This allows building web application with powerful text search, faceted search, real-time indexing, dynamic clustering, database integration, and rich document handling.

Apache Solr is written in Java, which runs as a standalone server to serve the search results via REST-like HTTP/XML and JSON APIs. So, this Solr server can be easily integrated with an application, which is written in other programming languages. Due to all these features, this search server is used by Netflix, AOL, CNET, and Zappos.

Ambari is very specific to Hortonworks. Apache Ambari is a web-based tool that supports Apache Hadoop cluster supply, management, and monitoring. Ambari handles most of the Hadoop components, including HDFS, MapReduce, Hive, Pig, HBase, Zookeeper, Sqoop, and HCatlog as centralized management.

In addition, Ambari is able to install security based on the Kerberos authentication protocol over the Hadoop cluster. Also, it provides role-based user authentication, authorization, and auditing functions for users to manage integrated LDAP and Active Directory.

Big Data Analytics with R and Hadoop

By : Vignesh Prajapati

Big Data Analytics with R and Hadoop

By: Vignesh Prajapati

Overview of this book

Related Content you might be interested in

Current Title:

Big Data Analytics with R and Hadoop

Understanding Hadoop subprojects

Tip

Tip