This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:
These libraries can be useful for preprocessing text before passing it to an NLTK object, or postprocessing text that has been processed and extracted using NLTK. Coming up is an example that ties many of these tools together.
Let's say you need to parse a blog article about a restaurant. You can use lxml
or BeautifulSoup
to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime
object with dateutil
. Once...