The problem is that the HTML on Froogle is seriously broken.
So I guess the question is how do I make Hpricot cope with this markup?
It obviously works great in the browser. Are there any tools that will
convert a string of html to a valid XML or DOM equivalent? It must be
possible because web browsers handle it all the time.
What I need to be able to do:
html = open(‘http://foo.com/’).read
html = html.clean_markup
html = Hpricot(html)
I had a similar problem last week and ended up doing exactly what you are proposing, i.e. a pre-processing step to clean up the HTML before feeding it to Hpricot.
Here is an oversimplified example of froogle's of malformed markup:
I believe there are Ruby libraries for cleaning up HTML though I'm not familiar with them. Perhaps you could just treat it as a long string and walk over it doing the following:
1. Scan forward until you find a tag (either opening or closing).
2. If the tag is a known potentially-broken one ('<tr>', '<th>', '<td>', etc) set a flag for that tag to indicate it is open (or push it onto a per-tag stack somewhere). Clear the flag (or pop the stack) if/when you see the matching closing tag.
3. When you see that tag again, if it hasn't been closed in the meantime, insert the closing tag yourself and clear your flag (pop your stack).
I think it will be easier to do than it sounds
Hope that helps,