-
Book Overview & Buying
-
Table Of Contents
Apache Spark for Data Science Cookbook
By :
In this recipe, we will see how to analyze the distribution of various variables in the data. Generally, we can take a histogram/boxplot of the variables to understand the distribution and also identify the outliers. But currently, Spark has no support for plotting the data. Let's see how we can perform analysis by generating frequency tables.
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.
Let's take an example of load prediction data. Here is what the sample data looks like:

Download the data from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.
The total record count is 614.
Credit_History. Here is the code to generate the frequency distribution of set of variables such as Loan_Status and Credit_History...
Change the font size
Change margin width
Change background colour