Looking for an HTML parser

nuno <rails-mailing-list@...> writes:

Hello, I'm looking for an HTML parser that can handle bad formed input
(unclosed tags).

There's a pretty good HTML parser in RoR ActionPack but it's doesn't
handle bad formed documents


Just a technical point: Unclosed tags are _not_ badly formed in HTML, they are
exactly the _right_ way to do things in HTML. HTML is not supposed to be an XML
based language, and self-closing tags is invalid.

That said, I agree with the person who said it's better to just treat it a one
long string and regex it.



(BTW, the OP said 'unclosed tags' not 'self-closing tags' (by which I think you mean empty tags))

More importantly, this illustrates an ambiguity that makes dealing with ill-formed html difficult, even with a regex. What was meant? a nested list or two separate lists? indentation suggests one thing, but a peak in a browser another. But surely the author looked at the page in the browser and saw that it was okay. Right, surely. But with a little CSS who knows what was seen.

Tools like Tidy will turn that example into:


which is probably how a browser would interpret it. Some of the other tools will do something similar when parsing it.