Hpricot help - parsing malformed HTML

Take a look at scrapi - if not to actually use then to steal Assaf's ideas. =) I THINK he has some sort of way to pre-process HTML with Tidy in there; might want to crib those ideas.

We also use tidy for cleaning up invalid xhtml with MasterView project.

You can get the ruby tidy wrapper here

http://rubyforge.org/projects/tidy

http://tidy.rubyforge.org/ (for usage info)

Note that it also requires that the tidy library available on the server as well. It is available for both windows and *nix.

It works well at cleaning up invalid xhtml and the ruby tidy wrapper is simple to use. The only disadvantage is that you need to have the lib available and you need to set the path to the lib so that it can load it. I wish that could be automated some how, because it is a manual setup step.

Jeff