Hello Everyone,
I m looking for parsing doc/docx file in ruby on rails.
I have use File.open('filename','r'), but it shows special character instead of the content of file .
If all you want is the text content of the files, you can try the ancient Unix utility catdoc to do that. Just back-tick to that command (and make sure it's installed in your Web server's path). The result will not be pretty, but it will have all of the words in it.
The docx format is actually pretty simple: it is a zipped set of
files. If you upload it to the server and unzip it, you'll see a set
of xml files. You can poke around and figure out the format, or you
can find a spec on line.
For a start, here's the man page for catdoc, which you will need to install.
Then, read up on using the system() or backtick operators in a Ruby script to engage it. You'll need to have a path to the file you want to process, which is highly dependent on the system you're using to store the files. In Paperclip, I made this processor to extract text from PDF files (pdftotext is part of the same collection of utilities as catdoc, I believe):
#lib/paperclip_processors/text.rb
module Paperclip
# Handles extracting plain text from PDF file attachments
class Text < Processor
attr_accessor :whiny
# Creates a Text extract from PDF
def make
src = @file
dst = Tempfile.new([@basename, 'txt'].compact.join("."))
command = <<-end_command
"#{ File.expand_path(src.path) }"
"#{ File.expand_path(dst.path) }"
end_command
begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " "))
Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor."
rescue PaperclipCommandLineError
raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny
end
dst
end
end
end
Depending on how you are uploading your files, your mileage may vary. At the very simplest, the command would be