Using nokogiri

HI,

I want to grab some information about university names, and I found this term called "web scraping" I search about it in google, and there are tools in ruby. One of them is nokogiri but I'm a bit confused because it seems that it only gets information that its already in an html or xml

I found a webpage that have a list of university names as a

<select> </select> (html label)

and I want to grab that information

The question is... can I do that with nokogiri or another tool? The list is like a country list, but with the names of the universities of my country.

It seems that it get that information from an DB using ajax, and what I'm trying to do may not be legal or possible

I'll really appreciate if someone can help me to understand what this tool is used for, and if what I'm trying to do is possible

Thanks

Javier Q

HI,

Hi

I want to grab some information about university names, and I found

this term called “web scraping”

I search about it in google, and there are tools in ruby.

One of them is nokogiri but I’m a bit confused because it seems that

it only gets information that its already in an html or xml

I found a webpage that have a list of university names as a

(html label)

and I want to grab that information

The question is… can I do that with nokogiri or another tool?

The list is like a country list, but with the names of the

universities of my country.

It seems that it get that information from an DB using ajax, and what

I’m trying to do may not be legal or possible

I’ll really appreciate if someone can help me to understand what this

tool is used for, and if what I’m trying to do is possible

Thanks

Javier Q

Take a look on some screencasts:

http://railscasts.com/episodes?utf8=%E2%9C%93&search=mechanize

http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/

With nokogiri, you could use CSS3 selectors to grab the information you want

Best Regards,

Everaldo

HI,

I want to grab some information about university names, and I found this term called "web scraping" I search about it in google, and there are tools in ruby. One of them is nokogiri but I'm a bit confused because it seems that it only gets information that its already in an html or xml

Yes, Nokogiri is a toolkit for (among lots of other things) running Xpath or CSS queries against a text file. That text file can be anything -- an io stream of one sort or another with textual data in it will do.

I found a webpage that have a list of university names as a

<select> </select> (html label)

and I want to grab that information

The question is... can I do that with nokogiri or another tool? The list is like a country list, but with the names of the universities of my country.

A select can be traversed like any other DOM object, this should be fairly close:

#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset doc.css('#yourPickerId option').each do |opt|   foo = opt['value']   #whatever else you want to do with foo here end

It seems that it get that information from an DB using ajax, and what I'm trying to do may not be legal or possible

If it's Ajax, you'll need to run a JavaScript interpreter against it. Rails 3.1 shows the way to do that server-side. Once you have munged the page into a text stream that includes this desired data (flattened it down to the result of the Ajax plus the base code) then Nokogiri or Hpricot or any other XML/HTML parser could rip through that DOM and give you individual nodes to play with.

I'll really appreciate if someone can help me to understand what this tool is used for, and if what I'm trying to do is possible

Possible, sure. It's never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren't actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application.

Walter

A select can be traversed like any other DOM object, this should be fairly close:

#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset doc.css('#yourPickerId option').each do |opt| foo = opt['value'] #whatever else you want to do with foo here end

Thanks, in nokogiri example the result is like "link.content" and that's why I wondering how I can grab that information from the select group

Possible, sure. It's never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren't actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application.

Walter

You mean that in order to make a better application I have to deliver the information as JSON ? I'm kind of new with rails (not a completly newbie but... sort of :smiley: )

Thanks for your help

Javier Q

A select can be traversed like any other DOM object, this should be fairly close:

#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset doc.css('#yourPickerId option').each do |opt|       foo = opt['value']       #whatever else you want to do with foo here end

Thanks, in nokogiri example the result is like "link.content" and that's why I wondering how I can grab that information from the select group

There are some basic things one can do with nodes once you find them. content() spills out the textual content of any node (in the case of an option, that might give you the same thing as the Option.text attribute in JavaScript, but I wouldn't count on it specifically. In the case of a div, for example, content would give you the textual content of that div, minus any HTML tags, while inner_html would give you the actual HTML code defining all of the content tags as well as their text content.

For everything else, any other named attribute on the given node you access simply by putting the name of the attribute in as a key:

my_select['label'] or my_select['value'] or my_select['selected'] for example.

Behind the scenes, Nokogiri does some elegant metaprogramming with method_missing and gives you what you ask for if it's available.

Possible, sure. It's never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren't actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application.

Walter

You mean that in order to make a better application I have to deliver the information as JSON ?

I have seen this technique used for this reason, by splitting the application load over time on the same server or across servers. But then I would just throw a cacheing layer at the problem. Much less heartache.

I've also seen this technique used to obfuscate the data source, or simply to integrate third-party data sources into an existing site. .

I'm kind of new with rails (not a completly newbie but... sort of :smiley: )

Me too, but I've done quite a lot of Nokogiri recently, so it's all fairly fresh.

Walter

Hi, It's me again, I was doing some easy example and it worked... but now I've got some trouble Is there a way to provide nokogiri data such as username and password? because in a web I have to login first Scrapy gives a way to simulate user login, and I was wonderin if nokogiri can do the same

Javier

You wouldn't do it at the Nokogiri level. You need to read up on the open-uri library, there are all sorts of goodies in there to manage authentication, sessions, everything needed to create a Web client. That layer of your application will get the text stream that you will send on to Nokogiri. There's nothing in Noko that is specific to solving that problem, it starts from the assumption that you have a text file locally or a stream from another client like open-uri.

Walter

It seems that :http_basic_authentication [user, pass]

no longer works, I’ve tested with 2 webs and nothing,

Is there any other way?

Thanks

Javier

Can you post some code surrounding this, show the open-uri method call you're using?

Walter

require ‘nokogiri’

require ‘open-uri’

doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass])

doc.xpath(‘//select/option’).each do |opt|

puts opt.content

end

I grab some info from tha main page of the url (so it works) but when I enter to its login page with user/pass and try to get some, it seems to get information from other place (I’m not even sure from where)

Javier

doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass])

I’ve made a mistake, that was another file.

what I’m using is:

open(url, :http_basic_authentication => [user, pass] )

doc = Nokogiri::HTML(open(url))

Javier

Try all this out in a terminal with telnet or cURL -- see where you're actually going when you log in. You may be redirected in some subtle way. Also, a browser may throw a "basic authentication" dialog box when you're actually being challenged for digest authentication. :basic_authentication is not the same thing.

I think your real solution here will be to abstract out the open() bit inside the Nokogiri::HTML() call. Look for a gem that accepts a URL and returns a text stream and offers a whole bunch of configuration options for authentication. I am certain there are at least a handful of them out there. By separating your concerns in this way, you'll end up with a more modular solution so that you can swap in different credentials for each site you're scraping.

Walter

Hi,

The question is… can I do that with nokogiri or another tool?

The list is like a country list, but with the names of the

universities of my country.

Like Nokogiri, There is another tool called Hpricot

It seems that it get that information from an DB using ajax, and what

I’m trying to do may not be legal or possible

Ya its is possible.

See some examples which i tried with nokogiri,ruby

Nokogiri

http://sathia27.wordpress.com/2011/09/06/tbus-version-1-search-bus-routes-from-terminal/

http://sathia27.wordpress.com/2011/12/05/english-to-tamil-translator-script/

Hpricot

http://sathia27.wordpress.com/2010/10/29/learned-ruby-and-hpricot/