At this point, we are ready to begin cleaning the JSON file, extracting the details of each tweet that we want to keep in our long-term storage.
Since our motivating question only asks about URLs, we really only need to extract those, along with the tweet IDs. However, for the sake of practice in cleaning, and so that we can compare this exercise to what we did earlier in Chapter 7, RDBMS Cleaning Techniques, with the sentiment140
data set, let's design a small set of database tables as follows:
A
tweet
table, which only holds information about the tweetsA
hashtag
table, which holds information about which tweets referenced which hashtagsA
URL
table, which holds information about which tweets referenced which URLsA
mentions
table, which holds information about which tweets mentioned which users
This is similar to the structure we designed in Chapter 7, RDBMS Cleaning Techniques, except in that case we had to parse out our own list of hashtags...