regrex_crawler -- a crawler which uses regular expression to catch data from website

flyerhzm · September 13, 2009, 12:57pm

RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree

I give an example: a script to synchronize your github projects except fork projects, , please check example/github_projects.rb

require 'rubygems' require 'regexp_crawler'

crawler = RegexpCrawler::Crawler.new( :start_page => "http://github.com/flyerhzm", :continue_regexp => %r{<div class="title"><b><a href="(/ flyerhzm/.*?)">}m, :capture_regexp => %r{<a href="http://github.com/flyerhzm/\[^/"\]\*?\(? tree)?">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.* (<div class="(?:wikistyle|plain)">.*?</div>)</div>}m, :named_captures => ['title', 'description', 'body'], :save_method => Proc.new do |result, page| puts '=============================' puts page· puts result[:title] puts result[:description] puts result[:body][0..100] + "..." end,· :need_parse => Proc.new do |page, response_body| page =~ %r{http://github.com/flyerhzm/\\w\+\} && !response_body.index (/Fork of.*?<a href=".*?">/) end)· crawler.start

The results are as follows: