Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.
You'll need to install lxml
. See the previous recipe or http://lxml.de/installation.html for installation instructions.
We can use the clean_html()
function in the lxml.html.clean
module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:
>>> import lxml.html.clean >>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>') '<div><body>my text</body></div>'
The result is much cleaner and easier to deal with.