One of the advantages of Spark is the ability to read data from various data sources. However, this is not consistent and keeps changing with each Spark version. This section of the chapter will explain how to read files in CSV and JSON.
To read CSV data, you have to write the spark.read.csv("the file name with .csv") function. Here, we are reading the bank data that was used in the earlier chapters.
We have to ensure that the right sep function is used based on how the data is separated in the source data.
Now let's perform the following steps to read the data from the bank.csv file:
First, let's import the required packages into the Jupyter notebook:
import os import pandas as pd import numpy as np import collections from sklearn.base import TransformerMixin import random import pandas_profiling
Next, import all the required libraries, as illustrated...