RubyfulSoup vs Mechanize - Suprising Performance...

Michael wrote:

All,

If anyone is thinking about using either of these packages to
screen-scrape then I think you should consider mechanize as an option
over rubyfulsoup.

I was using rubyfulsoup to scrape html pages via a batch process where
performance didn't matter too very much. I needed to port the
functionality into a user process where performance did become an issue.
RubyfulSoup was taking about 30 seconds to initialize/load the page
prior to any processing being done on the page. This was unacceptable
for the user process.

I started looking into other options. SCRAPI was one option that seemed
really promising but I couldn't find enough documentation on it to make
much headway. It may be a good option for others who are more familiar
with CSS Selectors, but that person isn't me.

I then looked into WWW::Mechanize. Most of the reading I found on the
internet was related to using this for filling out forms and posting
data. It was hard to find good examples for parsing out text values,
etc... but this turned out to be a great option. WWW::Mechanize uses
hpricot for querying the html document with xpath or css selectors.

In my opinion, RubyfulSoup is much easier to learn and use initially.
However, WWW::Mechanize is MUCH faster - at least for my needs. The
page that was taking over 30 seconds to load into rubyfulsoup takes just
a few seconds to load into mechanize (and this is the amount of time it
takes to pull it down from the source url).
Parsing/searching/extracting is extremely fast and solved my performance
problems. I already knew xpath query statements so it was pretty easy.

Hopefully someone else can benefit from this before investing a lot of
time in rubyfulsoup just to find that it may have performance issues.

I was using regular expressions for some page-scraping, then found out about RubyfulSoup. It seemed like the "proper" way to do things, but I had to abandon it because, for my application, it was intolerably slow. I have to deal with hundreds or thousands of pages, and if the parsing takes much longer than the fetching (over a 0.5Mbit/s connection) that's no good for me.

regards

   Justin Forder

Michael wrote:

Justin Forder wrote:

Michael wrote:

prior to any processing being done on the page. This was unacceptable
etc... but this turned out to be a great option. WWW::Mechanize uses
Hopefully someone else can benefit from this before investing a lot of
time in rubyfulsoup just to find that it may have performance issues.

I was using regular expressions for some page-scraping, then found out
about RubyfulSoup. It seemed like the "proper" way to do things, but I
had to abandon it because, for my application, it was intolerably slow.
I have to deal with hundreds or thousands of pages, and if the parsing
takes much longer than the fetching (over a 0.5Mbit/s connection) that's
no good for me.

regards

   Justin Forder

Justin,

The parsing with mechanize is extremely fast!

Michael

Thanks, I'll take a look.

   Justin

Is there a ruby solution to spider and scrape javascript formed pages,
like when a form and it's options are made with javascript; I have a
job where I have to spider and scrape javascript built pages, wish I
can do it via ruby solution. Any suggestions?

I've never used it, but I've seen a Ruby extension:
    http://raa.ruby-lang.org/project/ruby-js/