Wikipedia Parser

David2 · April 12, 2007, 6:24pm

I need to parse and redisplay in html wikipedia articles (formatted with the wikipedia style). Has anyone encountered such a library in ruby ? Any libraries that are good at that?

Thanks

Chris_T · April 12, 2007, 6:40pm

David wrote:

I need to parse and redisplay in html wikipedia articles (formatted with the wikipedia style). Has anyone encountered such a library in ruby ? Any libraries that are good at that?

Thanks

>

Check out http://shanesbrain.net/articles/2006/10/02/screen-scraping-wikipedia Makes it dead easy to roll your own. Chris

Andy_Triboletti1 · April 12, 2007, 6:47pm

Usually you shouldn't use bots on wikipedia, but should download the
free database instead and use that. Read about their policy here:

If you have your own mediawiki install and want to use a bot, you can
check out pywikipedia bot: pywikibot download | SourceForge.net It's not in ruby,
but it works great.

Russell_Norris · April 12, 2007, 8:52pm

Actually, I’m not entirely sure that you shouldn’t use bots at all on the Wikipedia. According to the link you provided: “Robots or bots are automatic processes that interact with Wikipedia as though they were human editors”

That last bit sounds like they’re talking about a very specific kind of bot and not just a scraper.

RSL

njmacinnes · April 12, 2007, 9:11pm

“Robots or bots are automatic processes that interact with Wikipedia as though they were human editors.” There’s nothing against screen-scraping there. That policy is about bots which edit content. Otherwise, Google would be breaking WP policy.

This is taking the discussion a little off topic though. -Nathan

Shane_Vitarana · April 12, 2007, 9:12pm

I wrote that article a while ago. It'll be interesting to use WWW::Mechanize, or better yet, scRUBYt, which use Hpricot in the backend anyway.

Shane

http://shanesbrain.net

Andy_Triboletti1 · April 12, 2007, 9:19pm

If you just need to cache some pages for displaying later, screen scraping Wikipedia is a good choice compared to downloading the db. If you’re going to be parsing and redisplaying the content in real time that is against Wikipedia’s policy.

See http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime.3F

Topic		Replies	Views
Plugin for displaying wikipedia content rubyonrails-talk	4	193	June 14, 2009
Web Scraping rubyonrails-talk	0	147	November 5, 2009
Simple text parser for wikitext rubyonrails-talk	0	122	June 12, 2009
scRUBYt! 0.2.3 - Hpricot and Mechanize on steroids rubyonrails-talk announcement	0	166	February 21, 2007
help me in wikipedia and ruby rubyonrails-talk	6	199	August 17, 2008

Wikipedia Parser

Related topics

More Resources