Looking for an HTML parser

nuno <rails-mailing-list@...> writes:

Hello, I'm looking for an HTML parser that can handle bad formed input (unclosed tags).

There's a pretty good HTML parser in RoR ActionPack but it's doesn't handle bad formed documents

Thanks

Just a technical point: Unclosed tags are _not_ badly formed in HTML, they are exactly the _right_ way to do things in HTML. HTML is not supposed to be an XML based language, and self-closing tags is invalid.

That said, I agree with the person who said it's better to just treat it a one long string and regex it.

Consider:

<ul>    <li>a    <li>b    <li>c    <li>d <ul>    <li>e    <li>f    <li>g    <li>h

(BTW, the OP said 'unclosed tags' not 'self-closing tags' (by which I think you mean empty tags))

More importantly, this illustrates an ambiguity that makes dealing with ill-formed html difficult, even with a regex. What was meant? a nested list or two separate lists? indentation suggests one thing, but a peak in a browser another. But surely the author looked at the page in the browser and saw that it was okay. Right, surely. But with a little CSS who knows what was seen.

Tools like Tidy will turn that example into:

<ul> <li>a</li> <li>b</li> <li>c</li> <li>d <ul> <li>e</li> <li>f</li> <li>g</li> <li>h</li> </ul> </li> </ul>

which is probably how a browser would interpret it. Some of the other tools will do something similar when parsing it.

Cheers, Bob