Text Extraction and Indexing

Elliott_Clark · April 20, 2007, 5:39pm

Long story short I am going to have to index and search uploaded files. They will be in Word document, Excel, pdf, and text format. So what is the best way to extract information in RoR so that I can place the needed text into the database? There are command line utilities that will convert word to txt but I would prefer an in code solution if possible. Any suggestions on excel? The only thing I could find was a perl module.

I’ve decided to use acts_as_ferret as my indexing agent. Does anyone have any tips on using it other then http://www.railsenvy.com/2007/2/19/acts-as-ferret-tutorial ?

Jan_Prill · April 20, 2007, 5:54pm

Hi Elliott,

have a look at the ContentExtractor of RDig http://rdig.rubyforge.org/ this might get you a good way regarding pdf and word. Though it uses command line utilities as far as I know.

Cheers, Jan

Topic		Replies	Views
Index word and pdf documents for full-text search rubyonrails-talk	3	320	August 28, 2007
indexing uploaded files rubyonrails-talk	0	72	November 28, 2006
Word document parsing using ROR rubyonrails-talk	5	286	July 7, 2009
scanning word document in ruby rubyonrails-talk	9	162	November 21, 2011
Parsing useful subtext from text file, and saving it to Mysql. rubyonrails-talk	3	138	December 1, 2014

Text Extraction and Indexing

Related topics

More Resources