How to read Microsoft document file in ruby on rails?

walterdavis · September 13, 2012, 1:59pm

Hello Everyone, I m looking for parsing doc/docx file in ruby on rails. I have use File.open('filename','r'), but it shows special character instead of the content of file .

If all you want is the text content of the files, you can try the ancient Unix utility catdoc to do that. Just back-tick to that command (and make sure it's installed in your Web server's path). The result will not be pretty, but it will have all of the words in it.

Walter

Paul8 · September 15, 2012, 1:27pm

The docx format is actually pretty simple: it is a zipped set of files. If you upload it to the server and unzip it, you'll see a set of xml files. You can poke around and figure out the format, or you can find a spec on line.

Scott_Ribe · September 15, 2012, 2:07pm

You are really cruel to toy with him like that

rovin_varshney · September 16, 2012, 10:16am

Hi Walter Lee Davis , Paul

Please can u give some code snipet or give some more clarification about parsing doc file.

Dheeraj_Kumar · September 16, 2012, 12:28pm

Did you try googling? This was the third link I found.

http://deepakprasanna.blogspot.in/2011/06/parsing-pdfdocdocx-content-with-apache.html

Dheeraj Kumar

G_S_RAO · September 16, 2012, 12:59pm

Use of PDFTron may useful. google for “PDFTron Ruby Intigration” programs

walterdavis · September 16, 2012, 4:12pm

For a start, here's the man page for catdoc, which you will need to install.

Then, read up on using the system() or backtick operators in a Ruby script to engage it. You'll need to have a path to the file you want to process, which is highly dependent on the system you're using to store the files. In Paperclip, I made this processor to extract text from PDF files (pdftotext is part of the same collection of utilities as catdoc, I believe):

#lib/paperclip_processors/text.rb

module Paperclip # Handles extracting plain text from PDF file attachments class Text < Processor

attr_accessor :whiny

# Creates a Text extract from PDF def make src = @file dst = Tempfile.new([@basename, 'txt'].compact.join(".")) command = <<-end_command "#{ File.expand_path(src.path) }" "#{ File.expand_path(dst.path) }" end_command

begin success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " ")) Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor." rescue PaperclipCommandLineError raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny end dst end end end

Depending on how you are uploading your files, your mileage may vary. At the very simplest, the command would be

text_contents = system('/usr/bin/catdoc /root/relative/path/to/file.doc')

But that's hopelessly naive and will blow up on any error.

Walter

rovin_varshney · September 18, 2012, 5:40am

Hello Everyone,

Thanks everyone.Finally got a solution while searching things that you all had explained.

There is a docx gem for parsing docx file and docx-html for convert it into HTML.

require ‘docx’

d = Docx::Document.open('example.docx')
d.each_paragraph do |p|
  puts d
end

and for the docx file stored on s3 amazon.

Docx::Document.open(open(‘http://S3-URL/original.docx’,:ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))

A big Thanks to All.

Topic		Replies	Views
scanning word document in ruby rubyonrails-talk	9	210	November 21, 2011
Extracting the text content of a MS-word document rubyonrails-talk	0	128	November 3, 2008
How to Parse Microsoft Word Document rubyonrails-talk	8	288	March 17, 2011
Word document parsing using ROR rubyonrails-talk	5	326	July 7, 2009
How to generate DOC files in Rails rubyonrails-talk	1	233	October 1, 2008

How to read Microsoft document file in ruby on rails?

Related topics

More Resources