A common task when parsing HTML is extracting links. This is one of the core functions of every general web crawler. There are a number of Python libraries for parsing HTML, and lxml
is one of the best. As you'll see, it comes with some great helper functions geared specifically towards link extraction.
lxml
is a Python binding for the C libraries libxml2
and libxslt
. This makes it a very fast XML and HTML parsing library, while still being Pythonic. But that also means you need to install the C libraries for it to work. Installation instructions are available at http://lxml.de/installation.html. But if you're running Ubuntu Linux, installation is as easy as sudo apt-get install python-lxml
. You can also try doing pip install lxml
. The latest version as of this writing is 3.3.5.