Hpricot help - parsing malformed HTML

The problem is that the HTML on Froogle is seriously broken.


So I guess the question is how do I make Hpricot cope with this markup? It obviously works great in the browser. Are there any tools that will convert a string of html to a valid XML or DOM equivalent? It must be possible because web browsers handle it all the time.

What I need to be able to do:

  html = open('http://foo.com/').read   html = html.clean_markup   html = Hpricot(html)

I had a similar problem last week and ended up doing exactly what you are proposing, i.e. a pre-processing step to clean up the HTML before feeding it to Hpricot.

Here is an oversimplified example of froogle's of malformed markup:

  <table>   <tr>   <td>foo   <td>bar   <tr>   <td>baz   <td>boo   </table>

I believe there are Ruby libraries for cleaning up HTML though I'm not familiar with them. Perhaps you could just treat it as a long string and walk over it doing the following:

1. Scan forward until you find a tag (either opening or closing). 2. If the tag is a known potentially-broken one ('<tr>', '<th>', '<td>', etc) set a flag for that tag to indicate it is open (or push it onto a per-tag stack somewhere). Clear the flag (or pop the stack) if/when you see the matching closing tag. 3. When you see that tag again, if it hasn't been closed in the meantime, insert the closing tag yourself and clear your flag (pop your stack).

I think it will be easier to do than it sounds :wink:

Hope that helps, Andy

Andrew Stewart wrote:

> The problem is that the HTML on Froogle is seriously broken.



The example given is not malformed. It's perfectly acceptable HTML 4.01. The end tags for <tr> and <td> can be omitted.

Unless the DTD declaration claims it to be something newer than HTML 4.01, it is fine.

I would say this is a bug in Hpricot.

- Mark.

You can use RubyfulSoup to deal with HTML even when it isn't completely correct. It is packaged as a gem, but I unpacked it into the plugin directory and it's working for me. (Hpricot didn't exist at the time or I might have tried it.)

#Rubyful Soup #Elixir and Tonic #"The Screen-Scraper's Friend" #v1.0.4 #Rubyful Soup: "The brush has got entangled in it!"