Book Image

Spark for Data Science

By : Srinivas Duvvuri, Bikramaditya Singhal
Book Image

Spark for Data Science

By: Srinivas Duvvuri, Bikramaditya Singhal

Overview of this book

This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.
Table of Contents (18 chapters)
Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface

Machine learning


SparkR provides wrappers on existing MLLib functions. R formulas are implemented as MLLib feature transformers. A transformer is an ML pipeline (spark.ml) stage that takes a DataFrame as input and produces another DataFrame as output, which generally contains some appended columns. Feature transformers are a type of transformers that convert input columns to feature vectors and these feature vectors are appended to the source DataFrame. For example, in linear regression, string input columns are one-hot encoded and numeric values are converted to doubles. A label column will be appended (if not there in the data frame already) as a replica of the response variable.

In this section, we cover example code for the Naive Bayes and Gaussian GLM models. We do not explain the models as such or the summaries they produce. Instead, we go straight away to how it can be done using SparkR.

The Naive Bayes model

The Naïve Bayes model is an intuitively simple model that works with categorical...