Book Image

Big Data Analytics with R

By : Simon Walkowiak
Book Image

Big Data Analytics with R

By: Simon Walkowiak

Overview of this book

Big Data analytics is the process of examining large and complex data sets that often exceed the computational capabilities. R is a leading programming language of data science, consisting of powerful functions to tackle all problems related to Big Data processing. The book will begin with a brief introduction to the Big Data world and its current industry standards. With introduction to the R language and presenting its development, structure, applications in real world, and its shortcomings. Book will progress towards revision of major R functions for data management and transformations. Readers will be introduce to Cloud based Big Data solutions (e.g. Amazon EC2 instances and Amazon RDS, Microsoft Azure and its HDInsight clusters) and also provide guidance on R connectivity with relational and non-relational databases such as MongoDB and HBase etc. It will further expand to include Big Data tools such as Apache Hadoop ecosystem, HDFS and MapReduce frameworks. Also other R compatible tools such as Apache Spark, its machine learning library Spark MLlib, as well as H2O.
Table of Contents (16 chapters)
Big Data Analytics with R
About the Author
About the Reviewers

Naive Bayes with H2O on Hadoop with R

The growing number of machine learning applications in data science has led to the development of several Big Data predictive analytics tools as described in the first part of this chapter. It is even more exciting for R users that some of these tools connect well with the R language allowing data analysts to use R to deploy and evaluate machine learning algorithms on massive datasets. One such Big Data machine learning platform is H2O- open-source, hugely scalable, and fast data exploratory and machine learning software developed and maintained by California-based start-up (formerly known as 0xdata). As H2O is designed to effortlessly integrate with cloud computing platforms such as Amazon EC2 or Microsoft Azure, it has become the obvious choice for large businesses and organisations wanting to implement powerful machine and statistical learning models on massively scalable in-house or cloud-based infrastructures.

Running an H2O instance on Hadoop...