I am trying to scrape a site and then its children to get data I relate in tables, the only problems is that I keep getting an "OUT OF BUFFER SPACE" error. Is there a way to clear the buffer after each iteration or am I doing something wrong?
Here's the code: require 'rubygems' require 'mechanize' require 'active_record'
ActiveRecord::Base.establish_connection( #connection goes here )
class Major < ActiveRecord::Base has_many :courses end
class Course < ActiveRecord::Base belongs_to :major end
class Sections def scrape(url) agent = WWW::Mechanize.new page = agent.get(url) table = (page/'//table')[6] (table/"tr").each do |major| @newMajor = Major.new @newMajor.title = (major/'//td').first.inner_html @newMajor.abbrev = (major/'acronym').inner_html @newMajor.link_to = (major/'a').to_s.split('"')[1] puts title,abbrev,link_to end end end
class Classes attr_writer :major_id def scrape(url) agent = WWW::Mechanize.new page = agent.get("http://courses.tamu.edu/"\+url\.to\_s\) (page/"//td[@class='sectionheading']").each do |course| course = course.inner_html.strip.split(' ') course.pop @newCourse = Course.new @newCourse.major_id = @major_id @newCourse.course_no = course[1] @newCourse.name = course.slice!(3,course.length).join(' ') @newCourse.save end end end
AllMajors = Major.find(:all) AllMajors.each do |course| start = Time.now newClass = Classes.new newClass.major_id = course.id newClass.scrape(course.link_to) puts "Added courses for #{course.title}" finish = Time.now puts "Took #{finish-start} seconds" end puts "Finished scraping courses"