I've got a web-app currently partially working. The user uploads a .txt,
.docx or .doc file to the server.
Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.
My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)
Where or how do I sanely get the contents of those TXT files into the
database?
I built this feature in my first commercial Rails app. I used Paperclip for my file storage, which offers its own callback called 'after_post_process' that worked out perfectly for me.
First, I created a Paperclip processor to extract the text version of the uploaded file (mine were all PDF).
# /lib/paperclip_processors/text.rb
module Paperclip
# Handles extracting plain text from PDF file attachments
class Text < Processor
attr_accessor :whiny
# Creates a Text extract from PDF
def make
src = @file
dst = Tempfile.new([@basename, 'txt'].compact.join("."))
command = <<-end_command
"#{ File.expand_path(src.path) }"
"#{ File.expand_path(dst.path) }"
end_command
begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " "))
Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor."
rescue PaperclipCommandLineError
raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny
end
dst
end
end
end
Then in my document.rb (model for the file attachment), I added the following bits:
has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } }, :processors => [:text]
after_post_process :extract_text
private
def extract_text
file = File.open("#{pdf.queued_for_write[:text].path}","r")
plain_text = ""
while (line = file.gets)
plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
end
self.plain_text = plain_text
end
And that was that.
Walter