Big Data Analytics with R and Hadoop

Big Data Analytics with R and Hadoop

By : Vignesh Prajapati

Buy this Book

Big Data Analytics with R and Hadoop

By: Vignesh Prajapati

Buy this Book

Overview of this book

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing. Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop. You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming.

Big Data Analytics with R and Hadoop

Credits

About the Author

Acknowledgment

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Ready to Use R and Hadoop

Installing R

Installing RStudio

Understanding the features of R language

Installing Hadoop

Understanding Hadoop features

Learning the HDFS and MapReduce architecture

Understanding Hadoop subprojects

Summary

Writing Hadoop MapReduce Programs

Understanding the basics of MapReduce

Introducing Hadoop MapReduce

Understanding the Hadoop MapReduce fundamentals

Writing a Hadoop MapReduce example

Learning the different ways to write Hadoop MapReduce in R

Summary

Integrating R and Hadoop

Introducing RHIPE

Introducing RHadoop

Summary

Using Hadoop Streaming with R

Understanding the basics of Hadoop streaming

Understanding how to run Hadoop streaming with R

Exploring the HadoopStreaming R package

Summary

Learning Data Analytics with R and Hadoop

Understanding the data analytics project life cycle

Understanding data analytics problems

Summary

Understanding Big Data Analysis with Machine Learning

Introduction to machine learning

Supervised machine-learning algorithms

Unsupervised machine learning algorithm

Recommendation algorithms

Summary

Importing and Exporting Data from Various DBs

Learning about data files as database

Understanding MySQL

Understanding Excel

Understanding MongoDB

Understanding SQLite

Understanding PostgreSQL

Understanding Hive

Understanding HBase

Summary

References

R + Hadoop help materials

R groups

Hadoop groups

R + Hadoop groups

Popular R contributors

Popular Hadoop contributors

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Understanding the features of R language

There are over 3,000 R packages and the list is growing day by day. It would be beyond the scope of any book to even attempt to explain all these packages. This book focuses only on the key features of R and the most frequently used and popular packages.

Using R packages

R packages are self-contained units of R functionality that can be invoked as functions. A good analogy would be a .jar file in Java. There is a vast library of R packages available for a very wide range of operations ranging from statistical operations and machine learning to rich graphic visualization and plotting. Every package will consist of one or more R functions. An R package is a re-usable entity that can be shared and used by others. R users can install the package that contains the functionality they are looking for and start calling the functions in the package. A comprehensive list of these packages can be found at http://cran.r-project.org/ called Comprehensive R Archive Network (CRAN).

Performing data operations

R enables a wide range of operations. Statistical operations, such as mean, min, max, probability, distribution, and regression. Machine learning operations, such as linear regression, logistic regression, classification, and clustering. Universal data processing operations are as follows:

Data cleaning: This option is to clean massive datasets
Data exploration: This option is to explore all the possible values of datasets
Data analysis: This option is to perform analytics on data with descriptive and predictive analytics data visualization, that is, visualization of analysis output programming

To build an effective analytics application, sometimes we need to use the online Application Programming Interface (API) to dig up the data, analyze it with expedient services, and visualize it by third-party services. Also, to automate the data analysis process, programming will be the most useful feature to deal with.

R has its own programming language to operate data. Also, the available package can help to integrate R with other programming features. R supports object-oriented programming concepts. It is also capable of integrating with other programming languages, such as Java, PHP, C, and C++. There are several packages that will act as middle-layer programming features to aid in data analytics, which are similar to sqldf, httr, RMongo, RgoogleMaps, RGoogleAnalytics, and google-prediction-api-r-client.

Increasing community support

As the number of R users are escalating, the groups related to R are also increasing. So, R learners or developers can easily connect and get their uncertainty solved with the help of several R groups or communities.

The following are many popular sources that can be found useful:

R mailing list: This is an official R group created by R project owners.
R blogs: R has countless bloggers who are writing on several R applications. One of the most popular blog websites is http://www.r-bloggers.com/ where all the bloggers contribute their blogs.
Stack overflow: This is a great technical knowledge sharing platform where the programmers can post their technical queries and enthusiast programmers suggest a solution. For more information, visit http://stats.stackexchange.com/.
Groups: There are many other groups existing on LinkedIn and Meetup where professionals across the world meet to discuss their problems and innovative ideas.
Books: There are also lot of books about R. Some of the popular books are R in Action, by Rob Kabacoff, Manning Publications, R in a Nutshell, by Joseph Adler, O'Reilly Media, R and Data Mining, by Yanchang Zhao, Academic Press, and R Graphs Cookbook, by Hrishi Mittal, Packt Publishing.

Performing data modeling in R

Data modeling is a machine learning technique to identify the hidden pattern from the historical dataset, and this pattern will help in future value prediction over the same data. This techniques highly focus on past user actions and learns their taste. Most of these data modeling techniques have been adopted by many popular organizations to understand the behavior of their customers based on their past transactions. These techniques will analyze data and predict for the customers what they are looking for. Amazon, Google, Facebook, eBay, LinkedIn, Twitter, and many other organizations are using data mining for changing the definition applications.

The most common data mining techniques are as follows:

Regression: In statistics, regression is a classic technique to identify the scalar relationship between two or more variables by fitting the state line on the variable values. That relationship will help to predict the variable value for future events. For example, any variable y can be modeled as linear function of another variable x with the formula y = mx+c. Here, x is the predictor variable, y is the response variable, m is slope of the line, and c is the intercept. Sales forecasting of products or services and predicting the price of stocks can be achieved through this regression. R provides this regression feature via the lm method, which is by default present in R.
Classification: This is a machine-learning technique used for labeling the set of observations provided for training examples. With this, we can classify the observations into one or more labels. The likelihood of sales, online fraud detection, and cancer classification (for medical science) are common applications of classification problems. Google Mail uses this technique to classify e-mails as spam or not. Classification features can be served by glm, glmnet, ksvm, svm, and randomForest in R.
Clustering: This technique is all about organizing similar items into groups from the given collection of items. User segmentation and image compression are the most common applications of clustering. Market segmentation, social network analysis, organizing the computer clustering, and astronomical data analysis are applications of clustering. Google News uses these techniques to group similar news items into the same category. Clustering can be achieved through the knn, kmeans, dist, pvclust, and Mclust methods in R.
Recommendation: The recommendation algorithms are used in recommender systems where these systems are the most immediately recognizable machine learning techniques in use today. Web content recommendations may include similar websites, blogs, videos, or related content. Also, recommendation of online items can be helpful for cross-selling and up-selling. We have all seen online shopping portals that attempt to recommend books, mobiles, or any items that can be sold on the Web based on the user's past behavior. Amazon is a well-known e-commerce portal that generates 29 percent of sales through recommendation systems. Recommender systems can be implemented via Recommender()with the recommenderlab package in R.

Big Data Analytics with R and Hadoop

By : Vignesh Prajapati

Big Data Analytics with R and Hadoop

By: Vignesh Prajapati

Overview of this book

Related Content you might be interested in

Current Title:

Big Data Analytics with R and Hadoop

Understanding the features of R language

Using R packages

Performing data operations

Increasing community support

Performing data modeling in R