Most of the times, the free-form text can be found in text files; in this recipe, we will not be teaching you how to do that as we have already presented many ways of doing so. (Refer to the set of recipes in Chapter 1, Preparing the Data.)
Many times, however, we need to read data straight from the web: we might want to analyze a blog post, scrape an article, or analyze Facebook or Twitter posts. While Facebook and Twitter offer Application Programming Interfaces (APIs) that normally return answers in XML or JSON formats, processing HTML files is not as straightforward.
In this recipe, you will learn how to access a web page, read its content, and process it.
To execute this recipe, you will need urllib
, html5lib
, and Beautiful Soup
.
Urllib comes with Python 3 (https://docs.python.org/3/library/urllib.html). If, however, your configuration does not have...