Reading HTML or XML files allows us to parse web pages' content and to read documents or configurations described in XML.
Python has a built-in XML parser, the ElementTree
module which is perfect for parsing XML files, but when HTML is involved, it chokes quickly due to the various quirks of HTML.
Consider trying to parse the following HTML:
<html> <body class="main-body"> <p>hi</p> <img><br> <input type="text" /> </body> </html>
You will quickly face errors:
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 6
Luckily, it's not too hard to adapt the parser to handle at least the most common HTML files, such as self-closing/void tags.