Suppose we want to learn which web domains were linked the most in the Ferguson dataset. We can answer this question by extracting just the domain portion of the URL stored in the tdisplay
column in our ferguson_tweets_urls
table. For our purposes, we will consider everything before the first slash (/
) as the interesting part of the URL.
The following SQL query gives us the domain and count of posts that reference that domain:
SELECT left(tdisplay,locate('/',tdisplay)-1) as url, count(tid) as num FROM ferguson_tweets_urls GROUP BY 1 ORDER BY 2 DESC;
The result of this query is a dataset that looks something like the following (run on the sample set of 1,000 rows):
url |
num |
---|---|
bit.ly |
47 |
wp.me |
32 |
dlvr.it |
18 |
huff.to |
13 |
usat.ly |
9 |
ijreview.com |
8 |
latimes.com |
7 |
gu.com |
7 |
ift.tt |
7 |
This snippet of the dataset shows just the first few rows, but we can already see some of the more popular results are URL-shortening services, such as bit.ly
. We can also...