Index word and pdf documents for full-text search

Hello list,

I’m about to start building a document catalog in Rails, and it is basically a catalog for .doc and .pdf documents. First, I would like to know if there is anything like lucene for Rails (or Ruby) - maybe a Rails plugin?

Second: It would be nice if the user could search for a term and the search engine could “look into” the available documents. Would it be possible somehow to pre-index word and pdf documents so that they would be searchable ?



Hi Marcelo,

Take a look at acts_as_solr

It provides a simple Rails integration with Solr (a search server
based on Lucene that includes XML and JSON APIs).


Hi Mike, thanks for the tip!

As I'm finding out, this is extremely complicated. The search part is
pretty easy, we are using solr right now for that. The indexing part
is another matter. We are looking at indexing terabytes of word, pdf,
html, and whatever other formats we can support. We are storing ms
documents in pdf, but pdf is hard to index so I'm looking at using
open office to convert to xml, index it at that point, then convert
to pdf. So far I haven't found anything open source that works as
well as open office for document conversion.