Fetching HTML content
We've already introduced web scrapers in a previous chapter, using Goose library recompiled for Scala 2.11. We will create a method that takes a DStream
as input instead of an RDD, and only keep the valid text content with at least 500 words. We will finally return a stream of text alongside the associated hashtags (the popular ones):
def fetchHtmlContent(tStream: DStream[(String, Array[String])]) = { tStream .reduceByKey(_++_.distinct) .mapPartitions { it => val htmlFetcher = new HtmlHandler() val goose = htmlFetcher.getGooseScraper val sdf = new SimpleDateFormat("yyyyMMdd") it.map { case (url, tags) => val content = htmlFetcher.fetchUrl(goose, url, sdf) (content, tags) } .filter { case (contentOpt, tags) => contentOpt.isDefined && contentOpt.get.body.isDefined && contentOpt.get.body.get.split("\\s+").length >= 500 } .map { case (contentOpt...