Wikipedia Parser

I need to parse and redisplay in html wikipedia articles (formatted
with the wikipedia style). Has anyone encountered such a library in
ruby ? Any libraries that are good at that?

Thanks

David wrote:

I need to parse and redisplay in html wikipedia articles (formatted
with the wikipedia style). Has anyone encountered such a library in
ruby ? Any libraries that are good at that?

Thanks

>

Check out
http://shanesbrain.net/articles/2006/10/02/screen-scraping-wikipedia
Makes it dead easy to roll your own.
Chris

Usually you shouldn't use bots on wikipedia, but should download the
free database instead and use that.
Read about their policy here:
http://en.wikipedia.org/wiki/Wikipedia:Bots

If you have your own mediawiki install and want to use a bot, you can
check out pywikipedia bot:
http://sourceforge.net/projects/pywikipediabot/ It's not in ruby,
but it works great.

Actually, I’m not entirely sure that you shouldn’t use bots at all on the Wikipedia. According to the link you provided:
Robots or bots are automatic
processes
that interact with Wikipedia as though they were human editors”

That last bit sounds like they’re talking about a very specific kind of bot and not just a scraper.

RSL

Robots or bots are automatic processes that interact with Wikipedia as though they were human editors.” There’s nothing against screen-scraping there. That policy is about bots which edit content. Otherwise, Google would be breaking WP policy.

This is taking the discussion a little off topic though.
-Nathan

I wrote that article a while ago. It'll be interesting to use
WWW::Mechanize, or better yet, scRUBYt, which use Hpricot in the
backend anyway.

Shane

http://shanesbrain.net

If you just need to cache some pages for displaying later, screen scraping Wikipedia is a good choice compared to downloading the db. If you’re going to be parsing and redisplaying the content in real time that is against Wikipedia’s policy.

See http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime.3F