nuno <rails-mailing-list@...> writes:
Hello, I'm looking for an HTML parser that can handle bad formed input
(unclosed tags).
There's a pretty good HTML parser in RoR ActionPack but it's doesn't
handle bad formed documents
Thanks
Just a technical point: Unclosed tags are _not_ badly formed in HTML, they are
exactly the _right_ way to do things in HTML. HTML is not supposed to be an XML
based language, and self-closing tags is invalid.
That said, I agree with the person who said it's better to just treat it a one
long string and regex it.
Consider:
<ul>
<li>a
<li>b
<li>c
<li>d
<ul>
<li>e
<li>f
<li>g
<li>h
(BTW, the OP said 'unclosed tags' not 'self-closing tags' (by which I think you mean empty tags))
More importantly, this illustrates an ambiguity that makes dealing with ill-formed html difficult, even with a regex. What was meant? a nested list or two separate lists? indentation suggests one thing, but a peak in a browser another. But surely the author looked at the page in the browser and saw that it was okay. Right, surely. But with a little CSS who knows what was seen.
Tools like Tidy will turn that example into:
<ul>
<li>a</li>
<li>b</li>
<li>c</li>
<li>d
<ul>
<li>e</li>
<li>f</li>
<li>g</li>
<li>h</li>
</ul>
</li>
</ul>
which is probably how a browser would interpret it. Some of the other tools will do something similar when parsing it.
Cheers,
Bob