I am trying to spider a site using Hpricot, but I keep getting out of buffer error. It will only let me do about two sites at a time, is there a way to clear the buffer after I process each page so I won't blow the buffer?
Can you post the code you are using?
require 'rubygems' require 'hpricot' require 'open-uri' require 'active_record'
ActiveRecord::Base.establish_connection( #connection info )
class Major < ActiveRecord::Base has_many :courses end
class Course < ActiveRecord::Base belongs_to :major end
def scrape(url) doc = Hpricot(open(url)) tables =(doc/"table") (tables[6]/"tr").each do |major| createMajor major end end def createMajor(data) newMajor = Major.new newMajor.title = data.search("td").first.inner_html newMajor.abbrev =data.search("acronym").inner_html newMajor.link_to = data.search("a").to_s.split('"')[1] puts newMajor.save end def courses(url) puts url doc = Hpricot(open("http://courses.tamu.edu/"\+url\.to\_s\)) courses = (doc/"//td[@class='sectionheading']") courses.each do |course| createCourse course end end def createCourse(data) course = data.inner_html.strip.split(' ') major = course[0] course_no = course[1] puts major,course_no course.pop course_name = course.slice!(3,course.length).join(' ') puts course_name end
AllMajors = Major.find(:all, :limit=>3,:offset=>0) AllMajors.each do |course| courses(course.link_to,course.id) end #scrape(url goes here)
This what I was last test with. I had it where scrape would call courses, but that broke the buffer before I even got output, this outputs the data from two pages and then breaks.
Anyone someone has to know something about the buffer?