Hi,
I want to screen scrape information from some websites (I have
permission to do it).
I am using the Mechanize plugin. The websites are different from each
other, so I need to write a different RoR code to screen scrape each
website. There would be hundreds of different websites.
Ok, the problem is that I don’t know how to implement this in an
elegant and efficient way. My current quick and dirty solution is a
model that I call when I want to screen scrape a website:
I call it like: Spider.crawl(website_id)
It looks like:
class Spider < ActiveRecord::Base
require ‘mechanize’
def crawl(website_id)
if(website_id == 1)
//Mechanize code for screen scraping website 1
end
if(website_id == 2)
//Mechanize code for screen scraping website 2
end
.....
end
end
How can I improve that?
Is there at least a way to put the code for each website in an
external file, so then I can call just the code I need? That way I
would avoid working with a model that has thousands of lines…
Thanks for your help!
Hi, you can define a base class which contains all the common information for all your sites. Then you can define a subclass for easy site that
inherits from the base class. For example,
class Site
attr_accessor :name
def to_s
puts “using #{self.class}#to_s”
end
def crawl
puts “using #{self.class}#crawl”
end
end
class HerSite < Site
def crawl
puts “using #{self.class}#crawl version 1”
end
end
class HisSite < Site
def crawl
puts “using #{self.class}#crawl version 2”
end
end
Next, you can define a SiteFactory class for creating an instance of the given class which represents our site. Thus, this can be represented
as follows:
class SiteFactory
def create( site )
site.new
end
end
We can define our Spider class that has single class method that takes an instance of a site and invokes its crawl instance method.
class Spider
def self.crawl_site( site )
site.crawl
end
end
Putting it all together, we can crawl all of our sites by doing the following:
site_factory = SiteFactory.new
[ HerSite, HisSite ].each do | klass |
site = site_factory.create( klass )
Spider.crawl_site( site )
end
Finally, anytime you want to add a new site you just create a class that inherits from class Site that has a single instance called crawl that describes
its strategy for navigating the site. There’s an easier way to obtain all the classes that inherit class Site and I leave this as an exercise for you.
Good luck,
-Conrad