Michael wrote:
All,
If anyone is thinking about using either of these packages to screen-scrape then I think you should consider mechanize as an option over rubyfulsoup.
I was using rubyfulsoup to scrape html pages via a batch process where performance didn't matter too very much. I needed to port the functionality into a user process where performance did become an issue. RubyfulSoup was taking about 30 seconds to initialize/load the page prior to any processing being done on the page. This was unacceptable for the user process.
I started looking into other options. SCRAPI was one option that seemed really promising but I couldn't find enough documentation on it to make much headway. It may be a good option for others who are more familiar with CSS Selectors, but that person isn't me.
I then looked into WWW::Mechanize. Most of the reading I found on the internet was related to using this for filling out forms and posting data. It was hard to find good examples for parsing out text values, etc... but this turned out to be a great option. WWW::Mechanize uses hpricot for querying the html document with xpath or css selectors.
In my opinion, RubyfulSoup is much easier to learn and use initially. However, WWW::Mechanize is MUCH faster - at least for my needs. The page that was taking over 30 seconds to load into rubyfulsoup takes just a few seconds to load into mechanize (and this is the amount of time it takes to pull it down from the source url). Parsing/searching/extracting is extremely fast and solved my performance problems. I already knew xpath query statements so it was pretty easy.
Hopefully someone else can benefit from this before investing a lot of time in rubyfulsoup just to find that it may have performance issues.
I was using regular expressions for some page-scraping, then found out about RubyfulSoup. It seemed like the "proper" way to do things, but I had to abandon it because, for my application, it was intolerably slow. I have to deal with hundreds or thousands of pages, and if the parsing takes much longer than the fetching (over a 0.5Mbit/s connection) that's no good for me.
regards
Justin Forder