Threads and net/http: am I missing something?

Hi,

I have a backgroundrb worker that gets triggered every second. When it's triggered, it's supposed to make 2 - 15 http-requests using Net::HTTP. My idea was to put every execution into a thread so the next execution doesn't have to wait for the last one. So basically:

def http_requests   hosts.each do |host|     Thread.new do       begin         client = Net::HTTP.start(host)       rescue         #store host as inactive       ensure         client.finish if client.active?       end     end   end end

Of course that's not all it does, but I hope you understand what I'm trying to do here.

The thing is: this doesn't get done once a second. It appears that every HTTP-request is waiting for the last one to complete, which clots up Rails very fast!

My question is: why is this? Does this have anything to do with Ruby not being threadsafe (I doubt it, because that just means threads aren't executed as precisely as with jRuby, right?) or is Net::HTTP not able to make requests while another Net::HTTP request is still running? And what to do?

I hope you can help.

Update: I've tried doing it using EventMachine, which won't work either:

EM.run do   make_request(host) end

def make_request   begin     client = EventMachine::HttpRequest.new(host).get     host.set_active   rescue     host.set_inactive   ensure     EM.stop   end end

Now it's only executing the set_active or set_inactive methods every five seconds, even though I run this every second for 5 hosts. It should query 5 hosts per second. What am I missing? Is this EM- related? Should I have EventMachine check for responses more often?

Hi,

I have a backgroundrb worker that gets triggered every second. When

it’s triggered, it’s supposed to make 2 - 15 http-requests using

Net::HTTP. My idea was to put every execution into a thread so the

next execution doesn’t have to wait for the last one. So basically:

def http_requests

hosts.each do |host|

Thread.new do

  begin

    client = Net::HTTP.start(host)

  rescue

    #store host as inactive

  ensure

    client.finish if client.active?

  end

end

end

end

Of course that’s not all it does, but I hope you understand what I’m

trying to do here.

The thing is: this doesn’t get done once a second. It appears that

every HTTP-request is waiting for the last one to complete, which

clots up Rails very fast!

My question is: why is this? Does this have anything to do with Ruby

not being threadsafe (I doubt it, because that just means threads

aren’t executed as precisely as with jRuby, right?) or is Net::HTTP

not able to make requests while another Net::HTTP request is still

running? And what to do?

I hope you can help.

The Global Interpreter Lock (GIL) prevents threads from executing in

parallel when using Ruby 1.8.6 aka MRI, 1.8.7, and 1.9.1 aka YARV.

However, JRuby 1.3.x/1.4.x, MacRuby 0.5 Beta 2, Maglev and several

other upcoming Ruby VMs are not constrained by the GIL. Thus, they

can execute threads in parallel.

Good luck,

-Conrad

Emm just because the threads aren't all executing *simultaneously* doesn't that they aren't running in parallel (due to all the thread switching etc).

Regardless, I can't seem to reproduce the OPs behviour:

require 'net/http'

def hosts   %w[rubyforge.org www.scala-lang.org www.google.com www.gamefaqs.com allrecipes.com m2k2.taigaforum.com youtube.com gitorious.org everything2.com] end

def http_requests   hosts.each do |host|     Thread.new do       begin         puts "fetching host #{host}"         client = Net::HTTP.start(host)       rescue e         #store host as inactive       ensure         puts "finished with host #{host}"         client.finish if client.active?       end     end   end end

irb(main):001:0> http_requests fetching host rubyforge.orgfetching host www.scala-lang.orgfetching host www.google.com fetching host www.gamefaqs.com fetching host allrecipes.com fetching host m2k2.taigaforum.com finished with host rubyforge.orgfetching host youtube.com fetching host gitorious.org finished with host www.google.com fetching host everything2.com => ["rubyforge.org", "www.scala-lang.org", "www.google.com", "www.gamefaqs.com", "allrecipes.com", "m2k2.taigaforum.com", "youtube.com", "gitorious.org", "everything2.com"] irb(main):002:0>

finished with host www.scala-lang.org finished with host m2k2.taigaforum.comfinished with host www.gamefaqs.com finished with host allrecipes.com

finished with host youtube.com finished with host everything2.com finished with host gitorious.org

Am I missing something or misunderstanding the question?

Also "threadsafe" doesn't have anything to do with language itself (or even its implementations) but rather if code operates correctly and predictably when run in parallel. But blah terminology.

pharrington a écrit, le 11/18/2009 07:23 PM :

Emm just because the threads aren't all executing *simultaneously* doesn't that they aren't running in parallel (due to all the thread switching etc).

Regardless, I can't seem to reproduce the OPs behviour:

I'm too lazy to check the details, but I'd look at the implementation details of BackroundRB and EventMachine. I suspect the way they use select calls may interract badly with Net::HTTP.

For example, last time I checked if you wrap an HTTP get in a timeout block, the timeout doesn't work : the internal Net::HTTP timeouts take precedence and disable the global timeout. Timeout uses a thread which calls sleep, which is implemented with select IIRC...

Lionel

Hi,

I have a backgroundrb worker that gets triggered every second. When

it’s triggered, it’s supposed to make 2 - 15 http-requests using

Net::HTTP. My idea was to put every execution into a thread so the

next execution doesn’t have to wait for the last one. So basically:

def http_requests

hosts.each do |host|

Thread.new do

 begin
   client = Net::HTTP.start(host)
 rescue
   #store host as inactive
 ensure
   client.finish if client.active?
 end

end

end

end

Of course that’s not all it does, but I hope you understand what I’m

trying to do here.

The thing is: this doesn’t get done once a second. It appears that

every HTTP-request is waiting for the last one to complete, which

clots up Rails very fast!

My question is: why is this? Does this have anything to do with Ruby

not being threadsafe (I doubt it, because that just means threads

aren’t executed as precisely as with jRuby, right?) or is Net::HTTP

not able to make requests while another Net::HTTP request is still

running? And what to do?

I hope you can help.

The Global Interpreter Lock (GIL) prevents threads from executing in

parallel when using Ruby 1.8.6 aka MRI, 1.8.7, and 1.9.1 aka YARV.

However, JRuby 1.3.x/1.4.x, MacRuby 0.5 Beta 2, Maglev and several

other upcoming Ruby VMs are not constrained by the GIL. Thus, they

can execute threads in parallel.

Good luck,

-Conrad

Emm just because the threads aren’t all executing simultaneously

doesn’t that they aren’t running in parallel (due to all the thread

switching etc).

Each thread must acquire the lock before it can execute. Thus, it operates

similar to a queue data structure (i.e. first in first out (FIFO)) and this is how

it work today in regards to Ruby 1.8.6, 1.8.7, and 1.9.1. I know the C implementation

of the Ruby VM very well.

-Conrad

Wow, thanks for all your help, greatly appreciated.

The Global Interpreter Lock (GIL) prevents threads from executing in parallel when using Ruby 1.8.6 aka MRI, 1.8.7, and 1.9.1 aka YARV. However, JRuby 1.3.x/1.4.x, MacRuby 0.5 Beta 2, Maglev and several other upcoming Ruby VMs are not constrained by the GIL. Thus, they can execute threads in parallel.

The reason why we didn't choose jRuby was because it uses too much memory to be able to run this on a VPS. Is there any documentation available on using jRuby on a low-memory (<256MB) system? I've looked for it, but couldn't find it. Maybe there's an alternative workaround for the GIL? Our application uses up quite alot of memory, so when presented with the jRuby vs. Ruby (EE) question, I thought it was a choice between thread safety and memory usage, so I chose the latter. I didn't know there was more to think about.

Regardless, I can't seem to reproduce the OPs behviour:

Which Ruby implementation are you using? I'm very sure every thread in my piece of code is waiting for the other thread to finish, because I log the time at which the data is saved. Most of the time there's 10 - 40 seconds between them, even though the backgroundrb process should save at least one object every second.

Each thread must acquire the lock before it can execute. Thus, it operates similar to a queue data structure (i.e. first in first out (FIFO)) and this is how it work today in regards to Ruby 1.8.6, 1.8.7, and 1.9.1. I know the C implementation of the Ruby VM very well.

-Conrad

The C code will acquire the GIL yes, and then release it when its done its bit of business. This will happen any number of times within a given function. So yes while the first thread created is the first to run its bit of code, in no way does that mean its the first thread to finish, nor does it stop the interpreter from switching control to another than when the lock is given up in the middle of execution. Saying Ruby threads don't run in parallel is even less true than saying coroutines aren't a form of parallelism.

The reason why we didn't choose jRuby was because it uses too much memory to be able to run this on a VPS. Is there any documentation available on using jRuby on a low-memory (<256MB) system? I've looked for it, but couldn't find it. Maybe there's an alternative workaround for the GIL? Our application uses up quite alot of memory, so when presented with the jRuby vs. Ruby (EE) question, I thought it was a choice between thread safety and memory usage, so I chose the latter. I didn't know there was more to think about.

> Regardless, I can't seem to reproduce the OPs behviour:

Which Ruby implementation are you using? I'm very sure every thread in my piece of code is waiting for the other thread to finish, because I log the time at which the data is saved. Most of the time there's 10 - 40 seconds between them, even though the backgroundrb process should save at least one object every second.

xeno@Clover:~/projects$ ruby -v ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]

Threading's not the issue. Try looking into Lionel's suggestion; perhaps things will work as expected if you switch to delayed_job instead of BackgroundRB, or posssssibly even just a different HTTP client like curb. Or perhaps the problem's not related to the HTTP fetching at all and we don't have the whole story.

> The reason why we didn't choose jRuby was because it uses too much > memory to be able to run this on a VPS. Is there any documentation > available on using jRuby on a low-memory (<256MB) system? I've looked > for it, but couldn't find it. Maybe there's an alternative workaround > for the GIL? Our application uses up quite alot of memory, so when > presented with the jRuby vs. Ruby (EE) question, I thought it was a > choice between thread safety and memory usage, so I chose the latter. > I didn't know there was more to think about.

> > Regardless, I can't seem to reproduce the OPs behviour:

> Which Ruby implementation are you using? I'm very sure every thread in > my piece of code is waiting for the other thread to finish, because I > log the time at which the data is saved. Most of the time there's 10 - > 40 seconds between them, even though the backgroundrb process should > save at least one object every second.

xeno@Clover:~/projects$ ruby -v ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]

Threading's not the issue. Try looking into Lionel's suggestion; perhaps things will work as expected if you switch to delayed_job instead of BackgroundRB, or posssssibly even just a different HTTP client like curb. Or perhaps the problem's not related to the HTTP fetching at all and we don't have the whole story.

If the problem is in BackgroundRB I'm screwed. This whole system depends on BackgroundRB. It's not just a long-running task, most tasks will run daily for years and years to come. I'll check whether this is related to BackgroundRB and report back.

Well, BackgroundRB does have a problem because workers can't overlap, so a worker is put in the queue when it's started, which is a bit of a pain because some HTTP-requests take longer than others. But look at this:

  def schedule_queries     i = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,35]     i.each do |j|       Thread.new do         logger.info "Hello #{j}"         sleep 2         logger.info "Hello Again #{j}"       end       sleep 1     end   end

This is executed every 10 seconds and my log shows this:

Hello 1 Hello 2 Hello Again 1 Hello 3 Hello Again 2 Hello 4 Hello Again 3 Hello 5 Hello Again 4 Hello 6 Hello Again 5

(...)

Hello Again 23 Hello 35 Hello Again 24 Hello Again 35 Hello 1 Hello 2 Hello Again 1 Hello 3

The expected behaviour is for the second "Hello 1" to appear after the first "Hello 10", but it doesn't. However, that's not a real problem, because I can easily work with bigger collections. However, when I implement the script you presented earlier, it works! I don't know why, I certainly don't know how, but it does.

Could this have anything to do with ActiveRecord? Every time I call set_active, I do a host.save. Could that be the problem? I'll work on this some more in the morning.

Each thread must acquire the lock before it can execute. Thus, it operates

similar to a queue data structure (i.e. first in first out (FIFO)) and this

is how

it work today in regards to Ruby 1.8.6, 1.8.7, and 1.9.1. I know the C

implementation

of the Ruby VM very well.

-Conrad

The C code will acquire the GIL yes, and then release it when its done

its bit of business. This will happen any number of times within a

given function. So yes while the first thread created is the first to

run its bit of code, in no way does that mean its the first thread to

finish, nor does it stop the interpreter from switching control to

another than when the lock is given up in the middle of execution.

Saying Ruby threads don’t run in parallel is even less true than

saying coroutines aren’t a form of parallelism.

Actually, the C doesn’t require a GIL because it’s being executed outside the VM

within its own process. Thus, one can simulate very good parallel execution using

something like BackgroundRB because it’s implemented as a Ruby native-extension.

-Conrad

I'm a little further on this. I've started logging the process instead of writing to ActiveRecord. This is my code:

def schedule_queries   t = Time.now   hosts = get_hosts(30)   logger.info "Starting request for #{hosts.count} domains at #{t}"   domains.each do |domain|     Thread.new do       begin         logger.info "Making request for #{host.identifier} at # {Time.now}"         client = Net::HTTP.start(host.url)         #set_active(domain)         logger.info "Finished request for #{host.identifier} at # {Time.now}"       rescue         #set_inactive(domain)         logger.info "Error in request for #{host.identifier} at # {Time.now}"       ensure         client.finish if client.active? if !client.nil?       end     end   end end

The log shows this:

Starting request for 30 domains at Thu Nov 19 11:50:01 +0100 2009 Making request for dym at Thu Nov 19 11:50:01 +0100 2009 Making request for nsn at Thu Nov 19 11:50:11 +0100 2009 Finished request for dym at Thu Nov 19 11:50:21 +0100 2009 Making request for ren at Thu Nov 19 11:50:21 +0100 2009 Finished request for nsn at Thu Nov 19 11:50:31 +0100 2009 Making request for ixf at Thu Nov 19 11:50:31 +0100 2009 Finished request for ren at Thu Nov 19 11:50:41 +0100 2009 Making request for phw at Thu Nov 19 11:50:41 +0100 2009 Finished request for ixf at Thu Nov 19 11:50:51 +0100 2009 Making request for frk at Thu Nov 19 11:50:51 +0100 2009 Finished request for phw at Thu Nov 19 11:51:01 +0100 2009 Making request for gyt at Thu Nov 19 11:51:01 +0100 2009 Finished request for frk at Thu Nov 19 11:51:11 +0100 2009 Making request for nlb at Thu Nov 19 11:51:11 +0100 2009 Finished request for gyt at Thu Nov 19 11:51:21 +0100 2009 Making request for tdz at Thu Nov 19 11:51:21 +0100 2009 Error in request for tdz at Thu Nov 19 11:51:39 +0100 2009 Finished request for nlb at Thu Nov 19 11:51:39 +0100 2009

As you can see, it does do -some- threading, but it finishes requests only once every 10 seconds or so. What am I doing wrong? pharrington's example works for me, but this one doesn't.

Some things can block the entire ruby VM - you may be falling foul of one of them. In particular, domain name resolution can do that (there is a pure ruby dns resolver which doesn't have that caveat)

Fred

Hi Fred,

Thanks for your response. I hope you can answer three questions about this.

- Why is it that pharrington's example did work, even though the hosts he used (I copied them) were never resolved before on my server? I should have had the same problem, right?

- I've used net-dns before because what I really want to do is check a domain for existence in the DNS-records of a certain server. However, the problem with this is that because of the TTL, when I query a domain every hour, a domain that has been deleted from the DNS doesn't really get deleted from the "visible" records for 4 - 24 hours. That's something I really need to work around, do you have an idea how I can get that to work?

- Why is it that most of the logged messages are 10 seconds apart? That should tell me something, but I'm unsure what.

Thanks again.

Jaap

Hi Fred,

Thanks for your response. I hope you can answer three questions about this.

- Why is it that pharrington's example did work, even though the hosts he used (I copied them) were never resolved before on my server? I should have had the same problem, right?

maybe, maybe not - not sure what varies between different dns lookups.

- I've used net-dns before because what I really want to do is check a domain for existence in the DNS-records of a certain server. However, the problem with this is that because of the TTL, when I query a domain every hour, a domain that has been deleted from the DNS doesn't really get deleted from the "visible" records for 4 - 24 hours. That's something I really need to work around, do you have an idea how I can get that to work?

Are you really working around it by using net/http ?

- Why is it that most of the logged messages are 10 seconds apart? That should tell me something, but I'm unsure what.

Is this still inside backgroundrb or have you managed to reproduce this outside background rb ?

Fred

Hi Fred,

Thanks for your response. I hope you can answer three questions about

this.

  • Why is it that pharrington’s example did work, even though the hosts

he used (I copied them) were never resolved before on my server? I

should have had the same problem, right?

  • I’ve used net-dns before because what I really want to do is check a

domain for existence in the DNS-records of a certain server. However,

the problem with this is that because of the TTL, when I query a

domain every hour, a domain that has been deleted from the DNS doesn’t

really get deleted from the “visible” records for 4 - 24 hours. That’s

something I really need to work around, do you have an idea how I can

get that to work?

  • Why is it that most of the logged messages are 10 seconds apart?

That should tell me something, but I’m unsure what.

Thanks again.

Jaap

I’m a little further on this. I’ve started logging the process instead

of writing to ActiveRecord. This is my code:

As you can see, it does do -some- threading, but it finishes requests

only once every 10 seconds or so. What am I doing wrong? pharrington’s

example works for me, but this one doesn’t.

Some things can block the entire ruby VM - you may be falling foul of

one of them. In particular, domain name resolution can do that (there

is a pure ruby dns resolver which doesn’t have that caveat)

Fred

Jaap,which Ruby VM you’re using? Also, are you still using the BackgroundDRB?

-Conrad

Hi Fred,

On Nov 19, 1:22 pm, jhaagmans <jaap.haagm...@gmail.com> wrote:> Hi Fred,

> Thanks for your response. I hope you can answer three questions about > this.

> - Why is it that pharrington's example did work, even though the hosts > he used (I copied them) were never resolved before on my server? I > should have had the same problem, right?

maybe, maybe not - not sure what varies between different dns lookups.

Me neither. That's why I was wondering.

> - I've used net-dns before because what I really want to do is check a > domain for existence in the DNS-records of a certain server. However, > the problem with this is that because of the TTL, when I query a > domain every hour, a domain that has been deleted from the DNS doesn't > really get deleted from the "visible" records for 4 - 24 hours. That's > something I really need to work around, do you have an idea how I can > get that to work?

Are you really working around it by using net/http ?

Good point. The answer is no. I thought I'd work around it because with a full HTTP-request you'd not only query the DNS, you'd also query the webserver, but if someone doesn't actually delete the files from the server, you'd still get a 200-response.

Now I need to work around that as well and I doubt it's possible as we don't actually control the DNS servers we use. I can't think of a workaround for this.

> - Why is it that most of the logged messages are 10 seconds apart? > That should tell me something, but I'm unsure what.

Is this still inside backgroundrb or have you managed to reproduce this outside background rb ?

You're right, it's still inside backgroundrb. However, I ran pharrington's example from BRB as well.

@Conrad:

[jaap@server06 ~]$ ruby -v ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-linux], MBARI 0x8770, Ruby Enterprise Edition 20090928

And yes, I'm still using BRB, I really can't think of a way to avoid using BRB. I need to query a few thousand hosts every hour, every day.

I really regret choosing Ruby/Rails for this particular application at this point.

jhaagmans a écrit, le 11/19/2009 03:34 PM :

And yes, I'm still using BRB, I really can't think of a way to avoid using BRB. I need to query a few thousand hosts every hour, every day.

This translates to a few hosts each second. If I had this kind of load I wouldn't use a background job scheduler but a queue manager, several processes picking the requests from the queue and a custom-built scheduler. In fact I'm doing this for several projects and I use ActiveMessaging.

I really regret choosing Ruby/Rails for this particular application at this point.    Ruby/Rails has not much to do with your problems. Choosing the right tool for the job is the issue. I've a server running a web spider hand-coded in Ruby using the tools I described above and it successfully make 10s of thousands of HTTP queries per hour. The only limit I hit is the RAM available as I use several Ruby processes (which, by the way, internally use threads to handle simultaneous HTTP HEAD/GET requests efficiently). If I had say 32G i could probably hit 100k requests per hour. Given I use a queue manager, I could also add a second server and run queue processors on it to get more capacity...

Lionel

Hi Fred,

On Nov 19, 1:22 pm, jhaagmans jaap.haagm...@gmail.com wrote:> Hi Fred,

Thanks for your response. I hope you can answer three questions about

this.

  • Why is it that pharrington’s example did work, even though the hosts

he used (I copied them) were never resolved before on my server? I

should have had the same problem, right?

maybe, maybe not - not sure what varies between different dns lookups.

Me neither. That’s why I was wondering.

  • I’ve used net-dns before because what I really want to do is check a

domain for existence in the DNS-records of a certain server. However,

the problem with this is that because of the TTL, when I query a

domain every hour, a domain that has been deleted from the DNS doesn’t

really get deleted from the “visible” records for 4 - 24 hours. That’s

something I really need to work around, do you have an idea how I can

get that to work?

Are you really working around it by using net/http ?

Good point. The answer is no. I thought I’d work around it because

with a full HTTP-request you’d not only query the DNS, you’d also

query the webserver, but if someone doesn’t actually delete the files

from the server, you’d still get a 200-response.

Now I need to work around that as well and I doubt it’s possible as we

don’t actually control the DNS servers we use. I can’t think of a

workaround for this.

  • Why is it that most of the logged messages are 10 seconds apart?

That should tell me something, but I’m unsure what.

Is this still inside backgroundrb or have you managed to reproduce

this outside background rb ?

You’re right, it’s still inside backgroundrb. However, I ran

pharrington’s example from BRB as well.

@Conrad:

[jaap@server06 ~]$ ruby -v

ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-linux], MBARI 0x8770,

Ruby Enterprise Edition 20090928

And yes, I’m still using BRB, I really can’t think of a way to avoid

using BRB. I need to query a few thousand hosts every hour, every day.

I really regret choosing Ruby/Rails for this particular application at

this point.

Jaap, please tell if the following works for you:

require ‘net/http’

module Enumerable

def concurrently

map{ |item| Thread.new { yield item }}.each{ |t| t.join }

end

end

def hosts

%w[rubyforge.org www.scala-lang.org www.google.com www.gamefaqs.com

allrecipes.com m2k2.taigaforum.com youtube.com gitorious.org

everything2.com]

end

hosts.concurrently do |host|

begin

puts “\nfetching host #{host} - #{Time.now}\n”

client = Net::HTTP.start(host)

rescue e

#store host as inactive

ensure

puts “\nfinished with host #{host} - #{Time.now}\n”

client.finish if client.active?

end

end

-Conrad