My Rails application (Rails 4.1, Ruby 2.1.1) offers the user to upload a
file. This file will then be parsed by the application, and after the
parsing is done, it is deleted from the upload area.
In my controller, params[:upload] contains an object of class Tempfile,
which is already opened for reading. I am using #readline to read
through this file.
The problem now is that the file has encoding utf-8, and as soon as
reading contains a character which isn't also a 7-Bit ASCII character, I
get an exception.
What is the best way to read an uploaded UTF-8 file?
I was already thinking along the following line: The Tempfile class also
has a method #path, which returns the path of the uploaded file. I could
create a File object by opening this path, specify utf8 when opening it,
and read from this.
However, since this problem must occur quite frequently, I wonder
whether there is a way (maybe in the file_field_tag) to tell Rails that
the Tempfile object should be opened as utf8 for reading. Is this
possible, or is there another good way to deal with this problem?
In this case, it is pretty certain that ever file will contain UTF-8
characters, and in general, I think the cases are few where we can
assume input to be represented by 7-bit-ASCII.
What I do not know for sure is whether or not the file will have a BOM,
but I think Ruby can figure this out automatically, when supplying the
"BOM" option on opening.
It would make sense to allow also file using different encoding, such as
UTF-16, but this is something I will have to deal with later.
The stackoverflow link you presented, doesn't really answer my problem
though. It just describes how I can *open* an UTF-8 file, and this is
the workaround I'm using meanwhile (as outlined in my posting where I
say: "I could create a File object by opening....".
What I would like to know is, whether there is a simpler way (since the
file, after all, is already opened when my controller is entered), and
in particular why set_encoding doesn't work for my Tempfile object,
even though this would work well for a File object.
Gotcha. Is the file actually opened when the controller is entered? (That’s an honest question I’m interested in how that works coming as an upload from a form) The way you’ve described, that I failed to understand the first time, to me seems like the best way but I’d be interested to see what others have to say.
Yes, it is, as I found by trial-and-error. Note that the object is not
just a File, it is of class Tempfile. I think this is quite common when
working with a Tempfile object. To make a Tempfile threadsafe, you have
to combine the creation of the filename and the creation of the file
into one call (otherwise you have a race condition if another process
tries to create a tempfile in the same directory and by accident comes
up with the same name).
While I didn't dive into the source code to see, how Rails is
implemented in this respect, it would be reasonable to assume, that for
the upload, a Tempfile object is created for read+write, the uploaded
file is written to it, and the file pointer is repositioned at the
beginning of the file, before it is handed over to the controller. Since
the uploading process can't know anything about the encoding, the file
must have been opened as a binary file. That's why I had the idea that I
just need to set the encoding to the desired value before starting to
read from the file.
As far I understand this article, this related to Rails 3 and MySQL, and
how to use UTF8 encoded data everywhere. I don't know about MySQL, but
Rails 4 and Ruby 2 with SQLite don't suffer this problem: I didn't have
any trouble, processing all kinds of Unicode characters with my
application, and processing the uploaded file also works fine, as long I
use my (not very elegant) trick to open it a second time with the
desired encoding.
It now occurs to me, that the question is maybe not Rails-specific, but
a general Ruby question - how to change the encoding of a Tempfile
object.
That shows how to create a Tempfile with a given encoding but the question is when a user uploads a file through a form and Rails creates a Tempfile is there a way to indicate that it should always create those Tempfiles with a default encoding such as UTF-8?
That shows how to create a Tempfile with a given encoding but the question
is when a user uploads a file through a form and Rails creates a Tempfile is
there a way to indicate that it should always create those Tempfiles with a
default encoding such as UTF-8?
In that case it *is* a Rails specific issue, not a Ruby question as
suggested by the OP.
In that case is there something like a simple config change to be made in say config/application.rb that tells Rails how to encode created Tempfiles? If not would that be something that could/should be added to the Rails project itself?
That shows how to create a Tempfile with a given encoding but the question
is when a user uploads a file through a form and Rails creates a Tempfile is
there a way to indicate that it should always create those Tempfiles with a
default encoding such as UTF-8?
In that case it *is* a Rails specific issue, not a Ruby question as
suggested by the OP.
Indeed, you are right so far that it *might* be a Rails question. Still,
I wonder why (in general) it is not possible to change the encoding of
an existing (already open) Tempfile. Assuming that it is OK to "rewind"
the file, I don't see a technical reason, why this is not possible.
I don't think it would be a good idea to configure this on the Rails
side. Image the following scenario: We have a website, which allows
users to upload textfiles, the content of which will eventually go into
the database. Since we are generous about the encoding, we also provide
the user with a dropdown list to choose a suitable encoding.
When the user clicks the upload button, the controller gets the uploaded
file plus information about the encoding. Clearly, Rails can not
anticipate the encoding of the file. It just can upload the file
(binary), and provide the controller with an open file handle.
Now Ruby *does* have the set_encoding method for File, and Tempfile is-a
file, and set_encoding *can* be called - it just fails. We have nearly
everything in place. Now, if we can find out WHY set_encoding fails (and
this might be a generic Ruby question), we can find out what Rails (or
the programmer) can do to let things go smoothly....
OK, so after some digging. It seems that when you create your new File object and set the encoding you may not need to read the Tempfile in its entirety. You can create a new File object using the Tempfile. File.new(my_temp_file, encoding: ‘utf-8’) and then use this file. It should be using the Tempfile and just creating a new pointer to that file with a new encoding. If you wanted to read the lines out individually and just use that original Tempfile you could use force_encoding(‘utf-8’) on each line to make sure it is converting them to utf-8.
I think, the idea of first reading the lines, and then use
force_encoding on the strings, would not work for two reasons:
1. As I have experienced, I already get the exception on the first byte
which has the high-bit set (i.e. is not 7-bit ASCII)
2. If a UTF8 encoded character contained 0x0d or 0x0a, reading the line
without being aware of the encoding, would "split" the character into
two parts.
In addition, this solution would not account for a BOM (unless I write
special logic to extract an optional BOM on the first line being read).
Although I have only files without BOM at the moment, it is likely that
sooner or later I will also have to support uploading of files which
contain a BOM.