Looking for an HTML parser

nuno <rails-mailing-list@...> writes:

Hello, I'm looking for an HTML parser that can handle bad formed input
(unclosed tags).

There's a pretty good HTML parser in RoR ActionPack but it's doesn't
handle bad formed documents

Thanks

Just a technical point: Unclosed tags are _not_ badly formed in HTML, they are
exactly the _right_ way to do things in HTML. HTML is not supposed to be an XML
based language, and self-closing tags is invalid.

That said, I agree with the person who said it's better to just treat it a one
long string and regex it.

Consider:

<ul>
   <li>a
   <li>b
   <li>c
   <li>d
<ul>
   <li>e
   <li>f
   <li>g
   <li>h

(BTW, the OP said 'unclosed tags' not 'self-closing tags' (by which I think you mean empty tags))

More importantly, this illustrates an ambiguity that makes dealing with ill-formed html difficult, even with a regex. What was meant? a nested list or two separate lists? indentation suggests one thing, but a peak in a browser another. But surely the author looked at the page in the browser and saw that it was okay. Right, surely. But with a little CSS who knows what was seen.

Tools like Tidy will turn that example into:

<ul>
<li>a</li>
<li>b</li>
<li>c</li>
<li>d
<ul>
<li>e</li>
<li>f</li>
<li>g</li>
<li>h</li>
</ul>
</li>
</ul>

which is probably how a browser would interpret it. Some of the other tools will do something similar when parsing it.

Cheers,
Bob