Threads and net/http: am I missing something?

Hi,

I have a backgroundrb worker that gets triggered every second. When
it's triggered, it's supposed to make 2 - 15 http-requests using
Net::HTTP. My idea was to put every execution into a thread so the
next execution doesn't have to wait for the last one. So basically:

def http_requests
  hosts.each do |host|
    Thread.new do
      begin
        client = Net::HTTP.start(host)
      rescue
        #store host as inactive
      ensure
        client.finish if client.active?
      end
    end
  end
end

Of course that's not all it does, but I hope you understand what I'm
trying to do here.

The thing is: this doesn't get done once a second. It appears that
every HTTP-request is waiting for the last one to complete, which
clots up Rails very fast!

My question is: why is this? Does this have anything to do with Ruby
not being threadsafe (I doubt it, because that just means threads
aren't executed as precisely as with jRuby, right?) or is Net::HTTP
not able to make requests while another Net::HTTP request is still
running? And what to do?

I hope you can help.

Update: I've tried doing it using EventMachine, which won't work
either:

EM.run do
  make_request(host)
end

def make_request
  begin
    client = EventMachine::HttpRequest.new(host).get
    host.set_active
  rescue
    host.set_inactive
  ensure
    EM.stop
  end
end

Now it's only executing the set_active or set_inactive methods every
five seconds, even though I run this every second for 5 hosts. It
should query 5 hosts per second. What am I missing? Is this EM-
related? Should I have EventMachine check for responses more often?

Hi,

I have a backgroundrb worker that gets triggered every second. When

it’s triggered, it’s supposed to make 2 - 15 http-requests using

Net::HTTP. My idea was to put every execution into a thread so the

next execution doesn’t have to wait for the last one. So basically:

def http_requests

hosts.each do |host|

Thread.new do

  begin

    client = Net::HTTP.start(host)

  rescue

    #store host as inactive

  ensure

    client.finish if client.active?

  end

end

end

end

Of course that’s not all it does, but I hope you understand what I’m

trying to do here.

The thing is: this doesn’t get done once a second. It appears that

every HTTP-request is waiting for the last one to complete, which

clots up Rails very fast!

My question is: why is this? Does this have anything to do with Ruby

not being threadsafe (I doubt it, because that just means threads

aren’t executed as precisely as with jRuby, right?) or is Net::HTTP

not able to make requests while another Net::HTTP request is still

running? And what to do?

I hope you can help.

The Global Interpreter Lock (GIL) prevents threads from executing in

parallel when using Ruby 1.8.6 aka MRI, 1.8.7, and 1.9.1 aka YARV.

However, JRuby 1.3.x/1.4.x, MacRuby 0.5 Beta 2, Maglev and several

other upcoming Ruby VMs are not constrained by the GIL. Thus, they

can execute threads in parallel.

Good luck,

-Conrad

Emm just because the threads aren't all executing *simultaneously*
doesn't that they aren't running in parallel (due to all the thread
switching etc).

Regardless, I can't seem to reproduce the OPs behviour:

require 'net/http'

def hosts
  %w[rubyforge.org www.scala-lang.org www.google.com www.gamefaqs.com
allrecipes.com m2k2.taigaforum.com youtube.com gitorious.org
everything2.com]
end

def http_requests
  hosts.each do |host|
    Thread.new do
      begin
        puts "fetching host #{host}"
        client = Net::HTTP.start(host)
      rescue e
        #store host as inactive
      ensure
        puts "finished with host #{host}"
        client.finish if client.active?
      end
    end
  end
end

irb(main):001:0> http_requests
fetching host rubyforge.orgfetching host www.scala-lang.orgfetching
host www.google.com
fetching host www.gamefaqs.com
fetching host allrecipes.com
fetching host m2k2.taigaforum.com
finished with host rubyforge.orgfetching host youtube.com
fetching host gitorious.org
finished with host www.google.com
fetching host everything2.com
=> ["rubyforge.org", "www.scala-lang.org", "www.google.com",
"www.gamefaqs.com", "allrecipes.com", "m2k2.taigaforum.com",
"youtube.com", "gitorious.org", "everything2.com"]
irb(main):002:0>

finished with host www.scala-lang.org
finished with host m2k2.taigaforum.comfinished with host www.gamefaqs.com
finished with host allrecipes.com

finished with host youtube.com
finished with host everything2.com
finished with host gitorious.org

Am I missing something or misunderstanding the question?

Also "threadsafe" doesn't have anything to do with language itself (or
even its implementations) but rather if code operates correctly and
predictably when run in parallel. But blah terminology.

pharrington a écrit, le 11/18/2009 07:23 PM :

Emm just because the threads aren't all executing *simultaneously*
doesn't that they aren't running in parallel (due to all the thread
switching etc).

Regardless, I can't seem to reproduce the OPs behviour:

I'm too lazy to check the details, but I'd look at the implementation
details of BackroundRB and EventMachine. I suspect the way they use
select calls may interract badly with Net::HTTP.

For example, last time I checked if you wrap an HTTP get in a timeout
block, the timeout doesn't work : the internal Net::HTTP timeouts take
precedence and disable the global timeout.
Timeout uses a thread which calls sleep, which is implemented with
select IIRC...

Lionel

Hi,

I have a backgroundrb worker that gets triggered every second. When

it’s triggered, it’s supposed to make 2 - 15 http-requests using

Net::HTTP. My idea was to put every execution into a thread so the

next execution doesn’t have to wait for the last one. So basically:

def http_requests

hosts.each do |host|

Thread.new do

 begin
   client = Net::HTTP.start(host)
 rescue
   #store host as inactive
 ensure
   client.finish if client.active?
 end

end

end

end

Of course that’s not all it does, but I hope you understand what I’m

trying to do here.

The thing is: this doesn’t get done once a second. It appears that

every HTTP-request is waiting for the last one to complete, which

clots up Rails very fast!

My question is: why is this? Does this have anything to do with Ruby

not being threadsafe (I doubt it, because that just means threads

aren’t executed as precisely as with jRuby, right?) or is Net::HTTP

not able to make requests while another Net::HTTP request is still

running? And what to do?

I hope you can help.

The Global Interpreter Lock (GIL) prevents threads from executing in

parallel when using Ruby 1.8.6 aka MRI, 1.8.7, and 1.9.1 aka YARV.

However, JRuby 1.3.x/1.4.x, MacRuby 0.5 Beta 2, Maglev and several

other upcoming Ruby VMs are not constrained by the GIL. Thus, they

can execute threads in parallel.

Good luck,

-Conrad

Emm just because the threads aren’t all executing simultaneously

doesn’t that they aren’t running in parallel (due to all the thread

switching etc).

Each thread must acquire the lock before it can execute. Thus, it operates

similar to a queue data structure (i.e. first in first out (FIFO)) and this is how

it work today in regards to Ruby 1.8.6, 1.8.7, and 1.9.1. I know the C implementation

of the Ruby VM very well.

-Conrad

Wow, thanks for all your help, greatly appreciated.

The Global Interpreter Lock (GIL) prevents threads from executing in
parallel when using Ruby 1.8.6 aka MRI, 1.8.7, and 1.9.1 aka YARV.
However, JRuby 1.3.x/1.4.x, MacRuby 0.5 Beta 2, Maglev and several
other upcoming Ruby VMs are not constrained by the GIL. Thus, they
can execute threads in parallel.

The reason why we didn't choose jRuby was because it uses too much
memory to be able to run this on a VPS. Is there any documentation
available on using jRuby on a low-memory (<256MB) system? I've looked
for it, but couldn't find it. Maybe there's an alternative workaround
for the GIL? Our application uses up quite alot of memory, so when
presented with the jRuby vs. Ruby (EE) question, I thought it was a
choice between thread safety and memory usage, so I chose the latter.
I didn't know there was more to think about.

Regardless, I can't seem to reproduce the OPs behviour:

Which Ruby implementation are you using? I'm very sure every thread in
my piece of code is waiting for the other thread to finish, because I
log the time at which the data is saved. Most of the time there's 10 -
40 seconds between them, even though the backgroundrb process should
save at least one object every second.

Each thread must acquire the lock before it can execute. Thus, it operates
similar to a queue data structure (i.e. first in first out (FIFO)) and this
is how
it work today in regards to Ruby 1.8.6, 1.8.7, and 1.9.1. I know the C
implementation
of the Ruby VM very well.

-Conrad

The C code will acquire the GIL yes, and then release it when its done
its bit of business. This will happen any number of times within a
given function. So yes while the first thread created is the first to
run its bit of code, in no way does that mean its the first thread to
finish, nor does it stop the interpreter from switching control to
another than when the lock is given up in the middle of execution.
Saying Ruby threads don't run in parallel is even less true than
saying coroutines aren't a form of parallelism.

The reason why we didn't choose jRuby was because it uses too much
memory to be able to run this on a VPS. Is there any documentation
available on using jRuby on a low-memory (<256MB) system? I've looked
for it, but couldn't find it. Maybe there's an alternative workaround
for the GIL? Our application uses up quite alot of memory, so when
presented with the jRuby vs. Ruby (EE) question, I thought it was a
choice between thread safety and memory usage, so I chose the latter.
I didn't know there was more to think about.

> Regardless, I can't seem to reproduce the OPs behviour:

Which Ruby implementation are you using? I'm very sure every thread in
my piece of code is waiting for the other thread to finish, because I
log the time at which the data is saved. Most of the time there's 10 -
40 seconds between them, even though the backgroundrb process should
save at least one object every second.

xeno@Clover:~/projects$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]

Threading's not the issue. Try looking into Lionel's suggestion;
perhaps things will work as expected if you switch to delayed_job
instead of BackgroundRB, or posssssibly even just a different HTTP
client like curb. Or perhaps the problem's not related to the HTTP
fetching at all and we don't have the whole story.

> The reason why we didn't choose jRuby was because it uses too much
> memory to be able to run this on a VPS. Is there any documentation
> available on using jRuby on a low-memory (<256MB) system? I've looked
> for it, but couldn't find it. Maybe there's an alternative workaround
> for the GIL? Our application uses up quite alot of memory, so when
> presented with the jRuby vs. Ruby (EE) question, I thought it was a
> choice between thread safety and memory usage, so I chose the latter.
> I didn't know there was more to think about.

> > Regardless, I can't seem to reproduce the OPs behviour:

> Which Ruby implementation are you using? I'm very sure every thread in
> my piece of code is waiting for the other thread to finish, because I
> log the time at which the data is saved. Most of the time there's 10 -
> 40 seconds between them, even though the backgroundrb process should
> save at least one object every second.

xeno@Clover:~/projects$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]

Threading's not the issue. Try looking into Lionel's suggestion;
perhaps things will work as expected if you switch to delayed_job
instead of BackgroundRB, or posssssibly even just a different HTTP
client like curb. Or perhaps the problem's not related to the HTTP
fetching at all and we don't have the whole story.

If the problem is in BackgroundRB I'm screwed. This whole system
depends on BackgroundRB. It's not just a long-running task, most tasks
will run daily for years and years to come. I'll check whether this is
related to BackgroundRB and report back.

Well, BackgroundRB does have a problem because workers can't overlap,
so a worker is put in the queue when it's started, which is a bit of a
pain because some HTTP-requests take longer than others. But look at
this:

  def schedule_queries
    i =
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,35]
    i.each do |j|
      Thread.new do
        logger.info "Hello #{j}"
        sleep 2
        logger.info "Hello Again #{j}"
      end
      sleep 1
    end
  end

This is executed every 10 seconds and my log shows this:

Hello 1
Hello 2
Hello Again 1
Hello 3
Hello Again 2
Hello 4
Hello Again 3
Hello 5
Hello Again 4
Hello 6
Hello Again 5

(...)

Hello Again 23
Hello 35
Hello Again 24
Hello Again 35
Hello 1
Hello 2
Hello Again 1
Hello 3

The expected behaviour is for the second "Hello 1" to appear after the
first "Hello 10", but it doesn't. However, that's not a real problem,
because I can easily work with bigger collections. However, when I
implement the script you presented earlier, it works! I don't know
why, I certainly don't know how, but it does.

Could this have anything to do with ActiveRecord? Every time I call
set_active, I do a host.save. Could that be the problem? I'll work on
this some more in the morning.

Each thread must acquire the lock before it can execute. Thus, it operates

similar to a queue data structure (i.e. first in first out (FIFO)) and this

is how

it work today in regards to Ruby 1.8.6, 1.8.7, and 1.9.1. I know the C

implementation

of the Ruby VM very well.

-Conrad

The C code will acquire the GIL yes, and then release it when its done

its bit of business. This will happen any number of times within a

given function. So yes while the first thread created is the first to

run its bit of code, in no way does that mean its the first thread to

finish, nor does it stop the interpreter from switching control to

another than when the lock is given up in the middle of execution.

Saying Ruby threads don’t run in parallel is even less true than

saying coroutines aren’t a form of parallelism.

Actually, the C doesn’t require a GIL because it’s being executed outside the VM

within its own process. Thus, one can simulate very good parallel execution using

something like BackgroundRB because it’s implemented as a Ruby native-extension.

-Conrad

I'm a little further on this. I've started logging the process instead
of writing to ActiveRecord. This is my code:

def schedule_queries
  t = Time.now
  hosts = get_hosts(30)
  logger.info "Starting request for #{hosts.count} domains at #{t}"
  domains.each do |domain|
    Thread.new do
      begin
        logger.info "Making request for #{host.identifier} at #
{Time.now}"
        client = Net::HTTP.start(host.url)
        #set_active(domain)
        logger.info "Finished request for #{host.identifier} at #
{Time.now}"
      rescue
        #set_inactive(domain)
        logger.info "Error in request for #{host.identifier} at #
{Time.now}"
      ensure
        client.finish if client.active? if !client.nil?
      end
    end
  end
end

The log shows this:

Starting request for 30 domains at Thu Nov 19 11:50:01 +0100 2009
Making request for dym at Thu Nov 19 11:50:01 +0100 2009
Making request for nsn at Thu Nov 19 11:50:11 +0100 2009
Finished request for dym at Thu Nov 19 11:50:21 +0100 2009
Making request for ren at Thu Nov 19 11:50:21 +0100 2009
Finished request for nsn at Thu Nov 19 11:50:31 +0100 2009
Making request for ixf at Thu Nov 19 11:50:31 +0100 2009
Finished request for ren at Thu Nov 19 11:50:41 +0100 2009
Making request for phw at Thu Nov 19 11:50:41 +0100 2009
Finished request for ixf at Thu Nov 19 11:50:51 +0100 2009
Making request for frk at Thu Nov 19 11:50:51 +0100 2009
Finished request for phw at Thu Nov 19 11:51:01 +0100 2009
Making request for gyt at Thu Nov 19 11:51:01 +0100 2009
Finished request for frk at Thu Nov 19 11:51:11 +0100 2009
Making request for nlb at Thu Nov 19 11:51:11 +0100 2009
Finished request for gyt at Thu Nov 19 11:51:21 +0100 2009
Making request for tdz at Thu Nov 19 11:51:21 +0100 2009
Error in request for tdz at Thu Nov 19 11:51:39 +0100 2009
Finished request for nlb at Thu Nov 19 11:51:39 +0100 2009

As you can see, it does do -some- threading, but it finishes requests
only once every 10 seconds or so. What am I doing wrong? pharrington's
example works for me, but this one doesn't.

Some things can block the entire ruby VM - you may be falling foul of
one of them. In particular, domain name resolution can do that (there
is a pure ruby dns resolver which doesn't have that caveat)

Fred

Hi Fred,

Thanks for your response. I hope you can answer three questions about
this.

- Why is it that pharrington's example did work, even though the hosts
he used (I copied them) were never resolved before on my server? I
should have had the same problem, right?

- I've used net-dns before because what I really want to do is check a
domain for existence in the DNS-records of a certain server. However,
the problem with this is that because of the TTL, when I query a
domain every hour, a domain that has been deleted from the DNS doesn't
really get deleted from the "visible" records for 4 - 24 hours. That's
something I really need to work around, do you have an idea how I can
get that to work?

- Why is it that most of the logged messages are 10 seconds apart?
That should tell me something, but I'm unsure what.

Thanks again.

Jaap

Hi Fred,

Thanks for your response. I hope you can answer three questions about
this.

- Why is it that pharrington's example did work, even though the hosts
he used (I copied them) were never resolved before on my server? I
should have had the same problem, right?

maybe, maybe not - not sure what varies between different dns lookups.

- I've used net-dns before because what I really want to do is check a
domain for existence in the DNS-records of a certain server. However,
the problem with this is that because of the TTL, when I query a
domain every hour, a domain that has been deleted from the DNS doesn't
really get deleted from the "visible" records for 4 - 24 hours. That's
something I really need to work around, do you have an idea how I can
get that to work?

Are you really working around it by using net/http ?

- Why is it that most of the logged messages are 10 seconds apart?
That should tell me something, but I'm unsure what.

Is this still inside backgroundrb or have you managed to reproduce
this outside background rb ?

Fred

Hi Fred,

Thanks for your response. I hope you can answer three questions about

this.

  • Why is it that pharrington’s example did work, even though the hosts

he used (I copied them) were never resolved before on my server? I

should have had the same problem, right?

  • I’ve used net-dns before because what I really want to do is check a

domain for existence in the DNS-records of a certain server. However,

the problem with this is that because of the TTL, when I query a

domain every hour, a domain that has been deleted from the DNS doesn’t

really get deleted from the “visible” records for 4 - 24 hours. That’s

something I really need to work around, do you have an idea how I can

get that to work?

  • Why is it that most of the logged messages are 10 seconds apart?

That should tell me something, but I’m unsure what.

Thanks again.

Jaap

I’m a little further on this. I’ve started logging the process instead

of writing to ActiveRecord. This is my code:

As you can see, it does do -some- threading, but it finishes requests

only once every 10 seconds or so. What am I doing wrong? pharrington’s

example works for me, but this one doesn’t.

Some things can block the entire ruby VM - you may be falling foul of

one of them. In particular, domain name resolution can do that (there

is a pure ruby dns resolver which doesn’t have that caveat)

Fred

Jaap,which Ruby VM you’re using? Also, are you still using the BackgroundDRB?

-Conrad

Hi Fred,

On Nov 19, 1:22 pm, jhaagmans <jaap.haagm...@gmail.com> wrote:> Hi Fred,

> Thanks for your response. I hope you can answer three questions about
> this.

> - Why is it that pharrington's example did work, even though the hosts
> he used (I copied them) were never resolved before on my server? I
> should have had the same problem, right?

maybe, maybe not - not sure what varies between different dns lookups.

Me neither. That's why I was wondering.

> - I've used net-dns before because what I really want to do is check a
> domain for existence in the DNS-records of a certain server. However,
> the problem with this is that because of the TTL, when I query a
> domain every hour, a domain that has been deleted from the DNS doesn't
> really get deleted from the "visible" records for 4 - 24 hours. That's
> something I really need to work around, do you have an idea how I can
> get that to work?

Are you really working around it by using net/http ?

Good point. The answer is no. I thought I'd work around it because
with a full HTTP-request you'd not only query the DNS, you'd also
query the webserver, but if someone doesn't actually delete the files
from the server, you'd still get a 200-response.

Now I need to work around that as well and I doubt it's possible as we
don't actually control the DNS servers we use. I can't think of a
workaround for this.

> - Why is it that most of the logged messages are 10 seconds apart?
> That should tell me something, but I'm unsure what.

Is this still inside backgroundrb or have you managed to reproduce
this outside background rb ?

You're right, it's still inside backgroundrb. However, I ran
pharrington's example from BRB as well.

@Conrad:

[jaap@server06 ~]$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-linux], MBARI 0x8770,
Ruby Enterprise Edition 20090928

And yes, I'm still using BRB, I really can't think of a way to avoid
using BRB. I need to query a few thousand hosts every hour, every day.

I really regret choosing Ruby/Rails for this particular application at
this point.

jhaagmans a écrit, le 11/19/2009 03:34 PM :

And yes, I'm still using BRB, I really can't think of a way to avoid
using BRB. I need to query a few thousand hosts every hour, every day.

This translates to a few hosts each second. If I had this kind of load I
wouldn't use a background job scheduler but a queue manager, several
processes picking the requests from the queue and a custom-built
scheduler. In fact I'm doing this for several projects and I use
ActiveMessaging.

I really regret choosing Ruby/Rails for this particular application at
this point.
  
Ruby/Rails has not much to do with your problems. Choosing the right
tool for the job is the issue.
I've a server running a web spider hand-coded in Ruby using the tools I
described above and it successfully make 10s of thousands of HTTP
queries per hour. The only limit I hit is the RAM available as I use
several Ruby processes (which, by the way, internally use threads to
handle simultaneous HTTP HEAD/GET requests efficiently). If I had say
32G i could probably hit 100k requests per hour. Given I use a queue
manager, I could also add a second server and run queue processors on it
to get more capacity...

Lionel

Hi Fred,

On Nov 19, 1:22 pm, jhaagmans jaap.haagm...@gmail.com wrote:> Hi Fred,

Thanks for your response. I hope you can answer three questions about

this.

  • Why is it that pharrington’s example did work, even though the hosts

he used (I copied them) were never resolved before on my server? I

should have had the same problem, right?

maybe, maybe not - not sure what varies between different dns lookups.

Me neither. That’s why I was wondering.

  • I’ve used net-dns before because what I really want to do is check a

domain for existence in the DNS-records of a certain server. However,

the problem with this is that because of the TTL, when I query a

domain every hour, a domain that has been deleted from the DNS doesn’t

really get deleted from the “visible” records for 4 - 24 hours. That’s

something I really need to work around, do you have an idea how I can

get that to work?

Are you really working around it by using net/http ?

Good point. The answer is no. I thought I’d work around it because

with a full HTTP-request you’d not only query the DNS, you’d also

query the webserver, but if someone doesn’t actually delete the files

from the server, you’d still get a 200-response.

Now I need to work around that as well and I doubt it’s possible as we

don’t actually control the DNS servers we use. I can’t think of a

workaround for this.

  • Why is it that most of the logged messages are 10 seconds apart?

That should tell me something, but I’m unsure what.

Is this still inside backgroundrb or have you managed to reproduce

this outside background rb ?

You’re right, it’s still inside backgroundrb. However, I ran

pharrington’s example from BRB as well.

@Conrad:

[jaap@server06 ~]$ ruby -v

ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-linux], MBARI 0x8770,

Ruby Enterprise Edition 20090928

And yes, I’m still using BRB, I really can’t think of a way to avoid

using BRB. I need to query a few thousand hosts every hour, every day.

I really regret choosing Ruby/Rails for this particular application at

this point.

Jaap, please tell if the following works for you:

require ‘net/http’

module Enumerable

def concurrently

map{ |item| Thread.new { yield item }}.each{ |t| t.join }

end

end

def hosts

%w[rubyforge.org www.scala-lang.org www.google.com www.gamefaqs.com

allrecipes.com m2k2.taigaforum.com youtube.com gitorious.org

everything2.com]

end

hosts.concurrently do |host|

begin

puts “\nfetching host #{host} - #{Time.now}\n”

client = Net::HTTP.start(host)

rescue e

#store host as inactive

ensure

puts “\nfinished with host #{host} - #{Time.now}\n”

client.finish if client.active?

end

end

-Conrad