The second main constraint of using Twitter is the constraint of noise. When most classification models are trained against dozens of different classes, we will be working against hundreds of thousands of distinct hashtags per day. We will be focusing on popular topics only, meaning the trending topics occurring within a defined batch window. However, because a 15 minute batch size on Twitter will not be sufficient enough to detect trends, we will apply a 24-hour moving window where all hashtags will be observed and counted, and where only the most popular ones will be kept.
Using this approach, we reduce the noise of unpopular hashtags, making our classifier much more accurate and scalable, and significantly reducing the number of articles to fetch as we only focus on trending URLs mentioned alongside popular topics. This allows us to save lots of time and resources spent analyzing irrelevant data (with regards...