ScrAPI HTTPNoAccessError

11155 · April 2, 2010, 11:31pm

Hi,

I'm having some problems using scrAPI. I'm getting some HTTPNoAccessErrors on certain urls.

The program searches a page (List of films - Wikiquote) for all of the links on it that go to pages with movie quotes on them.

It then loops through the list, pulling out the details from each page using this method:

def self.scrapemovies Scraper::Base.parser :html_parser

urlarray = Movie.findurls

moviescraper = Scraper.define do process "h1", :name => :text process "p:nth-child(4)", :description => :text result :description, :name end

urlarray.each do |url| fullurl = "http://en.wikiquote.org#\{url\}" movieurl = URI.parse(fullurl) data = moviescraper.scrape(movieurl) movie = Movie.new movie.url = fullurl movie.name = data.name movie.description = data.description movie.save end end

This worked ok until it got to 20,000 Leagues Under the Sea (1954 film) - Wikiquote which gave me the http error because it had a comma in the URL.

I wrote a little bit of code in the Movie.findurl method that just stripped out any URLs with commas or parentheses in as a bodge just to get things working, but I'm even getting the error on this URL: 27 Dresses - Wikiquote which is very odd, because it worked fine on the previous one which was : 25th Hour - Wikiquote.

I can't see the difference between them - I've tried manually visiting the page, and it's fine.

I'm assuming that I need to do some sort of cleverer parsing on the URLs (so that I can include the ones with commas and parentheses too).

Is the Scraper::Base.parser :html_parser line got anything to do with it? I couldn't get the Tidy plugin to work properly, but I'm not sure that it's got anything to do with the URL parsing anyway.

I'm totally stuck - thanks in advance for any help.

Jules.

11155 · April 2, 2010, 11:38pm

I should also add -

Before I got the findurl method to just strip out any URLs with non standard characters, I tried this line:

fullurl.gsub!(",","%2C")

Which replaced the commas with the URL friendlier code. This didn't work either, nor did putting the whole lot inside a CGI.escape("")

The scrAPI documentation isn't particularly helpful in regards to what format the URL needs to be in.

Topic		Replies	Views
Scrap Video URL rubyonrails-talk	1	152	December 26, 2011
NoMethodError "Read" for URI rubyonrails-talk	3	230	January 24, 2009
catch bad image urls from user provided content rubyonrails-talk	5	147	May 9, 2011
problem scraping using nokogiri - getting wrong characters rubyonrails-talk	2	160	November 27, 2011
scraped_resource rubyonrails-talk	0	64	January 27, 2011

ScrAPI HTTPNoAccessError

Related topics

More Resources