Text Extraction and Indexing

Long story short I am going to have to index and search uploaded files. They will be in Word document, Excel, pdf, and text format. So what is the best way to extract information in RoR so that I can place the needed text into the database? There are command line utilities that will convert word to txt but I would prefer an in code solution if possible. Any suggestions on excel? The only thing I could find was a perl module.

I’ve decided to use acts_as_ferret as my indexing agent. Does anyone have any tips on using it other then http://www.railsenvy.com/2007/2/19/acts-as-ferret-tutorial ?

Hi Elliott,

have a look at the ContentExtractor of RDig http://rdig.rubyforge.org/ this might get you a good way regarding pdf and word. Though it uses command line utilities as far as I know.

Cheers, Jan