In this chapter, we will build a system that takes a live feed of news articles and groups them together, where the groups have similar topics. You could run the system over several weeks (or longer) to see how trends change over that time.
Our system will start with the popular link aggregation website reddit, which stores lists of links to other websites, as well as a comments section for discussion. Links on reddit are broken into several categories of links, called subreddits. There are subreddits devoted to particular TV shows, funny images, and many other things. What we are interested in is the subreddits for news. We will use the /r/worldnews
subreddit in this chapter, but the code should work with any other subreddit.
In this chapter, our goal is to download popular stories, and then cluster them to see any major themes or concepts that occur. This will give us an insight into the popular focus, without having to manually analyze hundreds of individual stories...