Index word and pdf documents for full-text search

Hello list,

I’m about to start building a document catalog in Rails, and it is basically a catalog for .doc and .pdf documents. First, I would like to know if there is anything like lucene for Rails (or Ruby) - maybe a Rails plugin?

Second: It would be nice if the user could search for a term and the search engine could “look into” the available documents. Would it be possible somehow to pre-index word and pdf documents so that they would be searchable ?

Thanks,

Marcelo.

Hi Marcelo,

Take a look at acts_as_solr

http://acts_as_solr.railsfreaks.com/

It provides a simple Rails integration with Solr (a search server based on Lucene that includes XML and JSON APIs).

Mike

Hi Mike, thanks for the tip!

As I'm finding out, this is extremely complicated. The search part is pretty easy, we are using solr right now for that. The indexing part is another matter. We are looking at indexing terabytes of word, pdf, html, and whatever other formats we can support. We are storing ms documents in pdf, but pdf is hard to index so I'm looking at using open office to convert to xml, index it at that point, then convert to pdf. So far I haven't found anything open source that works as well as open office for document conversion.

Chris