One of the most common and useful kitchen tools is a strainer, also called a sieve, a colander, or chinois, the purpose of which is to separate solids from liquids during cooking. In this chapter, we will be building strainers for the data we find on the Web. We will learn how to create several types of programs that can help us find and keep the data we want, while discarding the parts we do not want.
In this chapter, we will:
Understand two options to envision the structure of an HTML page, either (a) as a collection of lines that we can look for patterns in, or (b) as a tree structure containing nodes for which we can identify and collect values.
Try out three methods to parse web pages, one that uses the line-by-line approach (regular expressions-based HTML parsing), and two that use the tree structure approach (Python's BeautifulSoup library and the Chrome browser tool called Scraper).
Implement all three of these techniques on some real...