Hpricot help - parsing malformed HTML

Andrew_Stewart · November 17, 2006, 8:37am

The problem is that the HTML on Froogle is seriously broken.

Agreed!

So I guess the question is how do I make Hpricot cope with this markup? It obviously works great in the browser. Are there any tools that will convert a string of html to a valid XML or DOM equivalent? It must be possible because web browsers handle it all the time.

What I need to be able to do:

html = open('http://foo.com/').read html = html.clean_markup html = Hpricot(html)

I had a similar problem last week and ended up doing exactly what you are proposing, i.e. a pre-processing step to clean up the HTML before feeding it to Hpricot.

Here is an oversimplified example of froogle's of malformed markup:

<table> <tr> <td>foo <td>bar <tr> <td>baz <td>boo </table>

I believe there are Ruby libraries for cleaning up HTML though I'm not familiar with them. Perhaps you could just treat it as a long string and walk over it doing the following:

1. Scan forward until you find a tag (either opening or closing). 2. If the tag is a known potentially-broken one ('<tr>', '<th>', '<td>', etc) set a flag for that tag to indicate it is open (or push it onto a per-tag stack somewhere). Clear the flag (or pop the stack) if/when you see the matching closing tag. 3. When you see that tag again, if it hasn't been closed in the meantime, insert the closing tag yourself and clear your flag (pop your stack).

I think it will be easier to do than it sounds

Hope that helps, Andy

Thomas_Mark_BLS_CTR · November 17, 2006, 2:51pm

Andrew Stewart wrote:

> The problem is that the HTML on Froogle is seriously broken.

Agreed!

Disagree!

The example given is not malformed. It's perfectly acceptable HTML 4.01. The end tags for <tr> and <td> can be omitted.

Unless the DTD declaration claims it to be something newer than HTML 4.01, it is fine.

I would say this is a bug in Hpricot.

- Mark.

rab · November 17, 2006, 3:06pm

You can use RubyfulSoup to deal with HTML even when it isn't completely correct. It is packaged as a gem, but I unpacked it into the plugin directory and it's working for me. (Hpricot didn't exist at the time or I might have tried it.)

#Rubyful Soup #Elixir and Tonic #"The Screen-Scraper's Friend" #v1.0.4 #Rubyful Soup: "The brush has got entangled in it!"

Topic		Replies	Views
Hpricot help - parsing malformed HTML rubyonrails-talk	1	149	November 17, 2006
hpricot search condition rubyonrails-talk	2	88	September 10, 2009
best way to close unclosed tags in user input? rubyonrails-talk	2	186	November 17, 2007
replace html tag rubyonrails-talk	2	107	September 10, 2009
Parse HTML to text rubyonrails-talk	3	364	May 21, 2010

Hpricot help - parsing malformed HTML

Related topics

More Resources