different code for each record, how to implement??

Hi,

I want to screen scrape information from some websites (I have permission to do it).

I am using the Mechanize plugin. The websites are different from each other, so I need to write a different RoR code to screen scrape each website. There would be hundreds of different websites.

Ok, the problem is that I don't know how to implement this in an elegant and efficient way. My current quick and dirty solution is a model that I call when I want to screen scrape a website:

I call it like: Spider.crawl(website_id)

It looks like:

class Spider < ActiveRecord::Base

  require 'mechanize'

  def crawl(website_id)

          if(website_id == 1)                  //Mechanize code for screen scraping website 1           end

          if(website_id == 2)                  //Mechanize code for screen scraping website 2           end

           .....

   end

end

How can I improve that? Is there at least a way to put the code for each website in an external file, so then I can call just the code I need? That way I would avoid working with a model that has thousands of lines...

Thanks for your help!

Here are my, off the top of my head suggestions:

Different thor scripts for each website, perhaps a single script to call the rest of them.

I did something similar for scraping shopping cart information. Since I needed the same data on every page I wrote a generic crawler which would read the XPath string from the database for each item I wanted to scrape. Worked well.

If you just want to split it up then provide a set of models (not based on ActiveRecord), one for each site and call the scrape method from your switch list (which would be better as a case statement). If you derive them all from a common base then you can put any common code in the base.

Colin

Hi,

I want to screen scrape information from some websites (I have

permission to do it).

I am using the Mechanize plugin. The websites are different from each

other, so I need to write a different RoR code to screen scrape each

website. There would be hundreds of different websites.

Ok, the problem is that I don’t know how to implement this in an

elegant and efficient way. My current quick and dirty solution is a

model that I call when I want to screen scrape a website:

I call it like: Spider.crawl(website_id)

It looks like:

class Spider < ActiveRecord::Base

require ‘mechanize’

def crawl(website_id)

      if(website_id == 1)

             //Mechanize code for screen scraping website 1

      end



      if(website_id == 2)

             //Mechanize code for screen scraping website 2

      end



       .....

end

end

How can I improve that?

Is there at least a way to put the code for each website in an

external file, so then I can call just the code I need? That way I

would avoid working with a model that has thousands of lines…

Thanks for your help!

Hi, you can define a base class which contains all the common information for all your sites. Then you can define a subclass for easy site that

inherits from the base class. For example,

class Site

attr_accessor :name

def to_s

puts “using #{self.class}#to_s

end

def crawl

puts “using #{self.class}#crawl

end

end

class HerSite < Site

def crawl

puts “using #{self.class}#crawl version 1”

end

end

class HisSite < Site

def crawl

puts “using #{self.class}#crawl version 2”

end

end

Next, you can define a SiteFactory class for creating an instance of the given class which represents our site. Thus, this can be represented

as follows:

class SiteFactory

def create( site )

site.new

end

end

We can define our Spider class that has single class method that takes an instance of a site and invokes its crawl instance method.

class Spider

def self.crawl_site( site )

site.crawl

end

end

Putting it all together, we can crawl all of our sites by doing the following:

site_factory = SiteFactory.new

[ HerSite, HisSite ].each do | klass |

site = site_factory.create( klass )

Spider.crawl_site( site )

end

Finally, anytime you want to add a new site you just create a class that inherits from class Site that has a single instance called crawl that describes

its strategy for navigating the site. There’s an easier way to obtain all the classes that inherit class Site and I leave this as an exercise for you.

Good luck,

-Conrad

Hi,

I want to screen scrape information from some websites (I have

permission to do it).

I am using the Mechanize plugin. The websites are different from each

other, so I need to write a different RoR code to screen scrape each

website. There would be hundreds of different websites.

Ok, the problem is that I don’t know how to implement this in an

elegant and efficient way. My current quick and dirty solution is a

model that I call when I want to screen scrape a website:

I call it like: Spider.crawl(website_id)

It looks like:

class Spider < ActiveRecord::Base

require ‘mechanize’

def crawl(website_id)

      if(website_id == 1)

             //Mechanize code for screen scraping website 1

      end



      if(website_id == 2)

             //Mechanize code for screen scraping website 2

      end



       .....

end

end

How can I improve that?

Is there at least a way to put the code for each website in an

external file, so then I can call just the code I need? That way I

would avoid working with a model that has thousands of lines…

Thanks for your help!

Hi, you can define a base class which contains all the common information for all your sites. Then you can define a subclass for easy site that

inherits from the base class. For example,

class Site

attr_accessor :name

def to_s

puts “using #{self.class}#to_s

end

def crawl

puts “using #{self.class}#crawl

end

end

class HerSite < Site

def crawl

puts “using #{self.class}#crawl version 1”

end

end

class HisSite < Site

def crawl

puts “using #{self.class}#crawl version 2”

end

end

Next, you can define a SiteFactory class for creating an instance of the given class which represents our site. Thus, this can be represented

as follows:

class SiteFactory

def create( site )

site.new

end

end

The above class can be refactored as to the following:

class SiteFactory

def self.create( site )

site.new

end

end

We can define our Spider class that has single class method that takes an instance of a site and invokes its crawl instance method.

class Spider

def self.crawl_site( site )

site.crawl

end

end

Putting it all together, we can crawl all of our sites by doing the following:

site_factory = SiteFactory.new

[ HerSite, HisSite ].each do | klass |

site = site_factory.create( klass )

Spider.crawl_site( site )

end

Now, we can rewrite our calling routine to the following:

[ HerSite, HisSite ].each do | klass |

site = SiteFactory.create( klass )

Spider.crawl_site( site )

end

Enjoy,

-Conrad

ps: There’s always something you missed after you click send.

The above class can be refactored as to the following:

class SiteFactory    def self.create( site )      site.new    end end

I'm just curious, what exactly is the point of this class?

Now, we can rewrite our calling routine to the following:

[ HerSite, HisSite ].each do | klass |    site = SiteFactory.create( klass )    Spider.crawl_site( site ) end

Seems needlessly verbose, why not just get rid of the factory that isn't doing anything and just do...

     [ HerSite, HisSite ].each do | klass |         Spider.crawl_site(klass.new)      end

In fact, why not just...

     Site.subclasses.each { | klass | Spider.crawl_site(klass.new) }

Forgive me, I'm a Smalltalker, but this whole explicit factory business and explicit arrays of classes just looks too Java'ish in an object system with meta classes and reflection. Is there some reason you wouldn't just reflect the subclasses? Is there some reason for a factory that does nothing? Even if you need a factory, why wouldn't you just use class methods on Site?

The above class can be refactored as to the following:

class SiteFactory

def self.create( site )

 site.new

end

end

I’m just curious, what exactly is the point of this class?

Now, we can rewrite our calling routine to the following:

[ HerSite, HisSite ].each do | klass |

site = SiteFactory.create( klass )

Spider.crawl_site( site )

end

Seems needlessly verbose, why not just get rid of the factory that isn’t doing anything and just do…

[ HerSite, HisSite ].each do | klass |

Spider.crawl_site(klass.new)

end

In fact, why not just…

Site.subclasses.each { | klass | Spider.crawl_site(klass.new) }

Yes, the above is possible but I can see where just getting all the subclasses of an

class might night be what you want.

Forgive me, I’m a Smalltalker, but this whole explicit factory business and explicit arrays of classes just looks too Java’ish in an object system with meta classes and reflection. Is there some reason you wouldn’t just reflect the subclasses? Is there some reason for a factory that does nothing? Even if you need a factory, why wouldn’t you just use class methods on Site?

Next, the Ruby language 1.9.2/1.9.3dev doesn’t support a built in method called subclasses like Smalltalk. Thus, one could implement a subclasses method in the Ruby language as follows:

class Class

def subclasses

ObjectSpace.each_object(Class).select { |klass| klass < self } # select all the methods that are derived from self (i.e. Site).

end

end

This requires opening a class called Class and defining a method called subclasses. Furthermore, one can use a built in Ruby hook method call inherited to arrive at the same result. For example,

class Site

@subclasses =

class << self

attr_reader :subclasses

end

def self.inherited( klass )

@subclasses << klass

end

def to_s

puts “using #{self.class}#to_s

end

def crawl

puts “using #{self.class}#crawl version 0”

end

end

Ramon, you’re correct in saying that SiteFactory class could be remove for a much more concise solution.

-Conrad

Thank you all so much. I did it like you said, with a set of models not based on ActiveRecord.

Best regards,

Cristóbal