Download OnlineRetail.csv
from the link provided with the book. Then, you can load the file using Pandas.
The following is a simple way of reading a local file using Pandas:
import pandas as pd path = '/Users/sridharalla/Documents/OnlineRetail.csv' df = pd.read_csv(path)
However, since we are analyzing data in a Hadoop cluster, we should be using hdfs
not a local system. The following is an example of how the hdfs
file can be loaded into a pandas
DataFrame:
import pandas as pd from hdfs import InsecureClient client_hdfs = InsecureClient('http://localhost:9870') with client_hdfs.read('/user/normal/OnlineRetail.csv', encoding = 'utf-8') as reader: df = pd.read_csv(reader,index_col=0)
The following is what the following line of code does:
df.head(3)
You will get the following result:
Basically, it displays the top three entries in the DataFrame.
We can now experiment with the data. Enter the following:
len(df)
That should output this:
65499
That just means the length, or size, of the DataFrame...