First, we will create the boilerplate code for Spark configuration and the Spark session:
SparkConf conf = ... SparkSession session = ...
Next, we will load the dataset and find the number of rows in it:
Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail.csv");
This will print the number of rows in the dataset as:
Number of rows --> 541909
As you can see, this is not a very small dataset but it is not big data either. Big data can run into terabytes. We have seen the number of rows, so let's look at the first few rows now.
This will print the result as:
As you can see, this dataset is a list of transactions including the country name from where the transaction was made. But if you look at the columns of the tables, Spark has given a default name to the dataset columns. In order to provide a schema and better structure...