Corporate websites are usually made by teams or departments using specialized tools and templates. A lot of the content is generated on the fly and consists of a large part of JavaScript and CSS. This means that even if we download the content, we still have to, at least, evaluate the JavaScript code. One way that we can do this from a Python program is using the Selenium API. Selenium's main purpose is actually testing websites, but nothing stops us from using it to scrape websites.
Instead of scraping a website, we will scrape an IPython Notebook—the test_widget.ipynb
file in this book's code bundle. To simulate browsing this web page, we provided a unit test class in test_simulating_browsing.py
. In case you wondered, this is not the recommended way to test IPython Notebooks.
For historic reasons, I prefer using XPath to find HTML elements. XPath is a query language, which also works with HTML. This is not the only method, you can also use CSS selectors, tag names...