In this recipe, we use the famous Iris dataset and use Spark API NaiveBayes()
to classify/predict which of the three classes of flower a given set of observations belongs to. This is an example of a multi-class classifier and requires multi-class metrics for measurements of fit. The previous recipe a binary classification and metric to measure the fit.
- For the Naive Bayes exercise, we use a famous dataset called
iris.data
, which can be obtained from UCI. The dataset was originally introduced in the 1930s by R. Fisher. The set is a multivariate dataset with flower attribute measurements classified into three groups.
In short, by measuring four columns, we attempt to classify a species into one of the three classes of Iris flower (that is, Iris Setosa, Iris Versicolor, Iris Virginica).
We can download the data from here:
https://archive.ics.uci.edu/ml/datasets/Iris/
The column definition is as follows:
- Sepal length in cm
- Sepal width...