A SQL join is a process of combining two datasets based on a common column. Joins come in really handy for extracting extra values by combining multiple tables.
We are going to use Yelp data as part of this recipe, which is provided by Yelp for Yelp Data Challenge. The data is divided into the following six files:
yelp_academic_dataset_business.json
yelp_academic_dataset_review.json
yelp_academic_dataset_user.json
yelp_academic_dataset_checkin.json
yelp_academic_dataset_tip.json
photos
(from the photos auxiliary file)
We are going to use this data for multiple purposes across the book. This data really works for this recipe as it has joins everywhere.
Note
This data is already loaded in the s3a://sparkcookbook/yelpdata
Amazon S3 bucket for your convenience. Spark provides a convenient way to access S3 using the S3a
prefix. This is not the standard way to access S3 buckets though. S3 buckets are accessed using HTTP URL. There are a few ways to specify the URL. For...