net/http vs . . . curl? anything else? what's fastest

Hi,

I'm grabbing xml feeds with net/http, and I'm wondering if there is
anything else out there that's faster. Any suggestions?

Charlie

Someone have suggested me to use Hpricot.

  HTH,

I think hpricot uses open-uri to grab xml. I believe that open-uri is
a wrapper around net/http, so I don't think it will be faster than net/
http.

I'm looking for the grabbing part. I wonder if there is anything out
there faster than net/http.

Charlie

When you say faster, what do you mean?

Is it a throughput issue (i.e. # of docs/sec)?
Is it a latency issue (i.e. from the start of retrieving the doc to the time the doc gets back is too long)?
Is is an xml parsing issue (i.e. parsing that many documents is slow and loading the server)?

One way to fix the first kind of problem might be spawn a process to down-load each document instead of downloading the documents in series. But that's not really a Net/HTTP issue. Ruby is a nicely extensible language that you can plug in C modules that may be faster than the equivalent ruby code. Sometimes you can find modules that rely on C code instead of ruby code to perform a given task, and they may be faster.

One problem with TCP performance, especially with SSL, may be paying for the connection handshake. If you retrieve all your docs from the same address(es), you might want to engineer a solution that avoids having to setup and tear-down connections for each doc.

There are some good ideas here. Thanks. Here's what I know so far:

-- xmlparser (~.2 sec) is many many times faster than rexml (~1.9
sec) or hpricot (1.3 sec), at least the way I'm kludging it so far
-- now that i've decided on xmlparser, for now, the biggest time lag
is getting the content over the 'net via net/http (~1.0 - 1.8 sec) --
that is from the beginning of the request, to the time the content is
completely retrieved
-- i don't know where the time lag is coming from
-- it would be terrific to reuse a connection that is grabbing many
feeds from the same source, any hints on that?
-- i don't know whether throughput is an issue, and I don't know how
to break that down with net/http

Is there some kind of tutorial or guide you know about where I can
learn how to extend, say, curl to work with ruby, so I can grab
content with curl?

Charlie

http://curb.rubyforge.org/

This is just a suggestion:

If you're grabbing documents from the same place, you might want to
cobble your own server specific to sending xml documents. (A one-
trick pony that's very good and fast at its trick). On the "client"
side you have a local service that establishes a connection to your
remote service. A process on your client contacts the local server
which uses a connection in a pool of connections between the local
server and the remote server. When the document is retrieved from
the remote server, the connection is put back into the pool and the
document is returned to the process that requested it from the local
server. Because the local server, proxy if you will, pools its
connections you only pay for the ssl/tcp connection once. However,
this may require more work than you're willing to do - depending on
the degree of performance you need.

HTTP is also supposed to be able to re-use a connection, so just
keeping your http connections around, or pooling them, might help as
well.

Local Process ----Get doc---> Local server ---Get doc---> Remote server.

Thanks again. Your suggestions are a little beyond me, so it's going
to take me some time to figure them out.

I tried curb vs. net/http, and the results are almost identical.

From what I can tell, net/http uses http 1.1 by default. I tried

looping through and grabbing three different xml documents, with both
curb and net/http, but the second and third tries were just as slow as
the first. It would be nice if there were some "built-in" way to
reuse connections with curb or net/http, and I'm going to investigate
that.

Also, I would like to point out new times for the various parsers. I
was including the times it took to print to STDOUT in my parsing
times:

rexml: 1.6 sec
hpricot: .25 sec
xmlparser: .02 sec

Here are the config options I used to install curb, in case some other
newbie using FreeBSD 6.2, or some other config-needing OS, runs into
the installation problems I did.

1. after the gem install curb fails, chdir to ./ext
2. ruby extconf.rb --with-curl-lib=/usr/local/lib --with-curl-
include=/usr/local/include/

(or your path/to/lib or path/to/include)

Charlie

charlie caroff wrote:

Hi,

I'm grabbing xml feeds with net/http, and I'm wondering if there is
anything else out there that's faster. Any suggestions?

Charlie

Hi Charlie

It appears that you are doing quite a few things similar to me! In my case, I'm using curl to grab XML files over HTTP. I'm not sure which is faster but in my case, I have to get a few megabytes of data every 5 minutes, so speed is not that critical. For what it's worth, Ruby + curl has served me well.

Cheers,
Mohit.
8/16/2007 | 11:27 AM.