RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree
I give an example: a script to synchronize your github projects except fork projects, , please check example/github_projects.rb
require 'rubygems' require 'regexp_crawler'
crawler = RegexpCrawler::Crawler.new(
:start_page => "http://github.com/flyerhzm",
:continue_regexp => %r{<div class="title"><b><a href="(/
flyerhzm/.*?)">}m,
:capture_regexp => %r{<a href="http://github.com/flyerhzm/\[^/"\]\*?\(?
tree)?">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.*
(<div class="(?:wikistyle|plain)">.*?</div>)</div>}m,
:named_captures => ['title', 'description', 'body'],
:save_method => Proc.new do |result, page|
puts '============================='
puts page·
puts result[:title]
puts result[:description]
puts result[:body][0..100] + "..."
end,·
:need_parse => Proc.new do |page, response_body|
page =~ %r{http://github.com/flyerhzm/\\w\+\} && !response_body.index
(/Fork of.*?<a href=".*?">/)
end)·
crawler.start
The results are as follows: