net/http vs . . . curl? anything else? what's fastest

charlie_caroff · August 15, 2007, 8:24pm

Hi,

I'm grabbing xml feeds with net/http, and I'm wondering if there is anything else out there that's faster. Any suggestions?

Charlie

Davi · August 15, 2007, 8:51pm

Someone have suggested me to use Hpricot.

HTH,

charlie_caroff · August 15, 2007, 9:08pm

I think hpricot uses open-uri to grab xml. I believe that open-uri is a wrapper around net/http, so I don't think it will be faster than net/ http.

I'm looking for the grabbing part. I wonder if there is anything out there faster than net/http.

Charlie

Paul_Hoehne · August 15, 2007, 9:09pm

When you say faster, what do you mean?

Is it a throughput issue (i.e. # of docs/sec)? Is it a latency issue (i.e. from the start of retrieving the doc to the time the doc gets back is too long)? Is is an xml parsing issue (i.e. parsing that many documents is slow and loading the server)?

One way to fix the first kind of problem might be spawn a process to down-load each document instead of downloading the documents in series. But that's not really a Net/HTTP issue. Ruby is a nicely extensible language that you can plug in C modules that may be faster than the equivalent ruby code. Sometimes you can find modules that rely on C code instead of ruby code to perform a given task, and they may be faster.

One problem with TCP performance, especially with SSL, may be paying for the connection handshake. If you retrieve all your docs from the same address(es), you might want to engineer a solution that avoids having to setup and tear-down connections for each doc.

charlie_caroff · August 15, 2007, 9:39pm

There are some good ideas here. Thanks. Here's what I know so far:

-- xmlparser (~.2 sec) is many many times faster than rexml (~1.9 sec) or hpricot (1.3 sec), at least the way I'm kludging it so far -- now that i've decided on xmlparser, for now, the biggest time lag is getting the content over the 'net via net/http (~1.0 - 1.8 sec) -- that is from the beginning of the request, to the time the content is completely retrieved -- i don't know where the time lag is coming from -- it would be terrific to reuse a connection that is grabbing many feeds from the same source, any hints on that? -- i don't know whether throughput is an issue, and I don't know how to break that down with net/http

Is there some kind of tutorial or guide you know about where I can learn how to extend, say, curl to work with ruby, so I can grab content with curl?

Charlie

Paul_Hoehne · August 15, 2007, 9:43pm

http://curb.rubyforge.org/

Paul_Hoehne · August 15, 2007, 9:54pm

This is just a suggestion:

If you're grabbing documents from the same place, you might want to
cobble your own server specific to sending xml documents. (A one- trick pony that's very good and fast at its trick). On the "client"
side you have a local service that establishes a connection to your
remote service. A process on your client contacts the local server
which uses a connection in a pool of connections between the local
server and the remote server. When the document is retrieved from
the remote server, the connection is put back into the pool and the
document is returned to the process that requested it from the local
server. Because the local server, proxy if you will, pools its
connections you only pay for the ssl/tcp connection once. However,
this may require more work than you're willing to do - depending on
the degree of performance you need.

HTTP is also supposed to be able to re-use a connection, so just
keeping your http connections around, or pooling them, might help as
well.

Local Process ----Get doc---> Local server ---Get doc---> Remote server.

charlie_caroff · August 15, 2007, 10:33pm

Thanks again. Your suggestions are a little beyond me, so it's going to take me some time to figure them out.

I tried curb vs. net/http, and the results are almost identical.

From what I can tell, net/http uses http 1.1 by default. I tried

looping through and grabbing three different xml documents, with both curb and net/http, but the second and third tries were just as slow as the first. It would be nice if there were some "built-in" way to reuse connections with curb or net/http, and I'm going to investigate that.

Also, I would like to point out new times for the various parsers. I was including the times it took to print to STDOUT in my parsing times:

rexml: 1.6 sec hpricot: .25 sec xmlparser: .02 sec

Here are the config options I used to install curb, in case some other newbie using FreeBSD 6.2, or some other config-needing OS, runs into the installation problems I did.

1. after the gem install curb fails, chdir to ./ext 2. ruby extconf.rb --with-curl-lib=/usr/local/lib --with-curl- include=/usr/local/include/

(or your path/to/lib or path/to/include)

Charlie

Mohit_Sindhwani · August 16, 2007, 3:27am

charlie caroff wrote:

Hi,

I'm grabbing xml feeds with net/http, and I'm wondering if there is anything else out there that's faster. Any suggestions?

Charlie

Hi Charlie

It appears that you are doing quite a few things similar to me! In my case, I'm using curl to grab XML files over HTTP. I'm not sure which is faster but in my case, I have to get a few megabytes of data every 5 minutes, so speed is not that critical. For what it's worth, Ruby + curl has served me well.

Cheers, Mohit. 8/16/2007 | 11:27 AM.

Topic		Replies	Views
posting xml without using curl rubyonrails-talk	0	106	December 18, 2007
fastest ruby xml parser -- FreeBSD 6.2 rubyonrails-talk	4	140	August 15, 2007
replace REXML with Hpricot rubyonrails-talk	20	231	June 29, 2008
help me.... about net/http rubyonrails-talk	1	141	April 6, 2009
Net::http breaks post request, Curb (curl) does not rubyonrails-talk	2	168	January 20, 2010

net/http vs . . . curl? anything else? what's fastest

Related topics

More Resources