Html5lib

Html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

Lxml

Lxml is a full-featured, high performance Python library for processing XML and HTML.



Faster parser broken

Example

"The standard html.parser option handles broken html less well than other options while the html5lib option is closest to how a modern browser would handle broken html albeit at a slower rate than lxml would handle html parsing"

from question  

Beautiful Soup Remove Tag Error

"Lxml is the faster parser and can handle broken html quite well html5lib comes closest to how your browser would parse broken html but is a lot slower"

from question  

BeautifulSoup: how to ignore spurious end tags

Others

Example

Html5lib parser does a better job than lxml or html.parser handling the debate element in this case

from question  

How do I remove a spurious tag in BeautifulSoup

Try installing lxml which is more lenient and much faster;if that doesn t work html5lib is your best bet as that doesn t work html5lib s the most lenient but also the slowest

from question  

BeautifulSoup drops text when fixing up broken markup

Lxml parser is generally faster html5lib is the most lenient one - this kind of difference would be relevant if you have a broken or non-well-formed html to parse

from question  

Python beautifulsoup : lxml html.parser

Back to Home
Data comes from Stack Exchange with CC-BY-SA-4.0