Problem processing text file after uploading

I've got a web-app currently partially working. The user uploads a .txt,
.docx or .doc file to the server.

Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.

My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Where or how do I sanely get the contents of those TXT files into the
database?

See model attached:

Attachments:
http://www.ruby-forum.com/attachment/7574/doc_file.rb

I've got a web-app currently partially working. The user uploads a .txt,
.docx or .doc file to the server.

Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.

My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Where or how do I sanely get the contents of those TXT files into the
database?

I built this feature in my first commercial Rails app. I used Paperclip for my file storage, which offers its own callback called 'after_post_process' that worked out perfectly for me.

First, I created a Paperclip processor to extract the text version of the uploaded file (mine were all PDF).

# /lib/paperclip_processors/text.rb

module Paperclip
  # Handles extracting plain text from PDF file attachments
  class Text < Processor

    attr_accessor :whiny

    # Creates a Text extract from PDF
    def make
      src = @file
      dst = Tempfile.new([@basename, 'txt'].compact.join("."))
      command = <<-end_command
        "#{ File.expand_path(src.path) }"
        "#{ File.expand_path(dst.path) }"
      end_command

      begin
        success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " "))
        Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor."
      rescue PaperclipCommandLineError
        raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny
      end
      dst
    end
  end
end

Then in my document.rb (model for the file attachment), I added the following bits:

  has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } }, :processors => [:text]

  after_post_process :extract_text

  private
  def extract_text
    file = File.open("#{pdf.queued_for_write[:text].path}","r")
    plain_text = ""
    while (line = file.gets)
      plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
    end
    self.plain_text = plain_text
  end

And that was that.

Walter

But...paperclip is OLD and unmaintained, and this is also a learning
project.

So is there some (best practices) way to do the following things without
having to make another pass over my doc_file or using paperclip:

1. upload .doc and store metadata
2. convert to plain text and write .txt to hard drive
3. grab contents of .txt file an store in database

Wouldn't the obvious answer be to do the file handling in before_save?

And is there a reason to write the text to a file in the first place if you're
just going to save it in the DB?

Hassan Schroeder wrote in post #1067807:

Well, since it's a "learning project" maybe that would be a good place
to start :slight_smile:

Alternatively, you might consider pushing the doc-to-text conversion
into a background job, which adds the text of the db record once it's
finished. Or use an Observer to add the text after after_save.

Multiple possibilities...

With files it is often better just to store them in files and not in
the database. Certainly they should not be stored in both file and
database.

Colin

Hassan Schroeder wrote in post #1067812:

Have a look at the Rails Guide on debugging for techniques that can be
used to debug your code. If you still can't work out what is going on
then come back with the details of the section of code that is failing
to so what you expect.

Colin

Start by defining exactly what "doesn't seem to function" means :slight_smile:

Hassan Schroeder wrote in post #1067817:

You need to do some debugging to see what is going on. Is the save
failing or is it not getting to the save statement for some reason?
Having worked out which of those is happening then do more debugging
to find out why.

Colin

OK, why not?

As Colin suggested, study the debugging guide (or just put logging
statements in the code to see what's happening at each step).

I know you guys seem to be sticking to the RTFM hardline, but it seems
as though debugging in the model has very few options without importing
a bunch of gems.

Even on the page recommended there are 35 mentions of controller, and
only 4 mentions of model.

I installed debugger 'gem install debugger', but it doesn't integrate at
all with webrick ('rails s') and there apparently is no ruby-debug for
1.9.3 (ughh..)

I've put a bunch of logger.info in my model, but I now know no more than
I did before.

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info "we are now in store_docfile"
message.

I have a feeling this might be something deeper than a tiny typo *shrug*

If one of you could PLEASE just look at my model and help me figure out
what's up, it would be appreciated.

I don't see any obvious problems in your original file.

If not with after_save, how are you calling store_docfile now? You
might want to post your new code for the model (and controller).

Hassan Schroeder wrote in post #1067836:

In your new example file, it's no surprise you're not seeing anything --
you're never calling `store_docfile` at all. (No, that random standalone
`:store_docfile` doesn't do what you're hoping it does.)

Either invoke it from a before_save, or make it a non-private method
(at least temporarily) and invoke it explicitly from your controller and
see what happens.

That is the clue then, but you are misinterpreting what you are
seeing. If it is not getting to the first line then it is not in fact
calling the method at all. Check out how you are calling it.

Colin

Perhaps you could start by “learning” how to decide whether a gem is unmaintained. For instance:

https://github.com/thoughtbot/paperclip/commits/master/

doesn’t exactly look like “no activity” to me…

–Matt Jones