We currently have four separate datasets that we are working with, but ultimately we would like to get it down to a single dataset. This chapter will focus on pairing down our datasets to one.
This section will not require any import of PySpark libraries but a background in SQL joins will come in handy, as we will explore multiple approaches to joining dataframes.
This section will walk through the following steps for joining dataframes in PySpark:
- Execute the following script to rename all field names in
ratings
, by appending a_1
to the end of the name:
for i in ratings.columns: ratings = ratings.withColumnRenamed(i, i+'_1')
- Execute the following script to
inner join
themovies
dataset to theratings
dataset, creating a new table calledtemp1
:
temp1 = ratings.join(movies, ratings.movieId_1 == movies.movieId, how = 'inner')
- Execute the following script to inner join the
temp1
dataset to thelinks
dataset, creating...