Cleaning and stripping HTML
Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.
Getting ready
You'll need to install lxml
. See the previous recipe or http://lxml.de/installation.html for installation instructions.
How to do it...
We can use the clean_html()
function in the lxml.html.clean
module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:
>>> import lxml.html.clean >>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>') '<div><body>my text</body></div>'
The result is much cleaner and easier to deal with.
How it works...
The lxml.html.clean_html()
function parses the HTML string into a tree and then iterates over and removes all nodes that should be removed. It...