Upload UTF-8 encoded textfile

My Rails application (Rails 4.1, Ruby 2.1.1) offers the user to upload a file. This file will then be parsed by the application, and after the parsing is done, it is deleted from the upload area.

So far, I have the following:

In my upload form, I have

    <%= file_field_tag :upload, {accept: 'text/plain', class: 'file_upload'} %>

In my controller, params[:upload] contains an object of class Tempfile, which is already opened for reading. I am using #readline to read through this file.

The problem now is that the file has encoding utf-8, and as soon as reading contains a character which isn't also a 7-Bit ASCII character, I get an exception.

What is the best way to read an uploaded UTF-8 file?

I was already thinking along the following line: The Tempfile class also has a method #path, which returns the path of the uploaded file. I could create a File object by opening this path, specify utf8 when opening it, and read from this.

However, since this problem must occur quite frequently, I wonder whether there is a way (maybe in the file_field_tag) to tell Rails that the Tempfile object should be opened as utf8 for reading. Is this possible, or is there another good way to deal with this problem?

Maybe try this? ruby - Write and read a file with utf-8 encoding - Stack Overflow Does it matter if every file is considered UTF-8 even if it never contains a UTF-8 character?

In this case, it is pretty certain that ever file will contain UTF-8 characters, and in general, I think the cases are few where we can assume input to be represented by 7-bit-ASCII.

What I do not know for sure is whether or not the file will have a BOM, but I think Ruby can figure this out automatically, when supplying the "BOM" option on opening.

It would make sense to allow also file using different encoding, such as UTF-16, but this is something I will have to deal with later.

The stackoverflow link you presented, doesn't really answer my problem though. It just describes how I can *open* an UTF-8 file, and this is the workaround I'm using meanwhile (as outlined in my posting where I say: "I could create a File object by opening....".

What I would like to know is, whether there is a simpler way (since the file, after all, is already opened when my controller is entered), and in particular why set_encoding doesn't work for my Tempfile object, even though this would work well for a File object.

Gotcha. Is the file actually opened when the controller is entered? (That’s an honest question I’m interested in how that works coming as an upload from a form) The way you’ve described, that I failed to understand the first time, to me seems like the best way but I’d be interested to see what others have to say.

Sorry I couldn’t be of more help.

Yes, it is, as I found by trial-and-error. Note that the object is not just a File, it is of class Tempfile. I think this is quite common when working with a Tempfile object. To make a Tempfile threadsafe, you have to combine the creation of the filename and the creation of the file into one call (otherwise you have a race condition if another process tries to create a tempfile in the same directory and by accident comes up with the same name).

While I didn't dive into the source code to see, how Rails is implemented in this respect, it would be reasonable to assume, that for the upload, a Tempfile object is created for read+write, the uploaded file is written to it, and the file pointer is repositioned at the beginning of the file, before it is handed over to the controller. Since the uploading process can't know anything about the encoding, the file must have been opened as a binary file. That's why I had the idea that I just need to set the encoding to the desired value before starting to read from the file.

Does setting

config.encoding = “utf-8”

``

in your config/application.rb help? You’d also need to add

encoding: UTF-8

``

to the top of your file.

I was reading this Craic Computing Tech Tips: Rails, UTF-8 and Heroku which seems to discuss this problem.

As far I understand this article, this related to Rails 3 and MySQL, and how to use UTF8 encoded data everywhere. I don't know about MySQL, but Rails 4 and Ruby 2 with SQLite don't suffer this problem: I didn't have any trouble, processing all kinds of Unicode characters with my application, and processing the uploaded file also works fine, as long I use my (not very elegant) trick to open it a second time with the desired encoding.

It now occurs to me, that the question is maybe not Rails-specific, but a general Ruby question - how to change the encoding of a Tempfile object.

I have not been following this thread in detail, but [1] discusses how to use the encoding option when creating a Tempfile.

[1] Class: Tempfile (Ruby 1.9.3)

Colin

Colin,

That shows how to create a Tempfile with a given encoding but the question is when a user uploads a file through a form and Rails creates a Tempfile is there a way to indicate that it should always create those Tempfiles with a default encoding such as UTF-8?

Colin,

That shows how to create a Tempfile with a given encoding but the question is when a user uploads a file through a form and Rails creates a Tempfile is there a way to indicate that it should always create those Tempfiles with a default encoding such as UTF-8?

In that case it *is* a Rails specific issue, not a Ruby question as suggested by the OP.

Colin

In that case is there something like a simple config change to be made in say config/application.rb that tells Rails how to encode created Tempfiles? If not would that be something that could/should be added to the Rails project itself?

Colin Law wrote in post #1152686:

That shows how to create a Tempfile with a given encoding but the question is when a user uploads a file through a form and Rails creates a Tempfile is there a way to indicate that it should always create those Tempfiles with a default encoding such as UTF-8?

In that case it *is* a Rails specific issue, not a Ruby question as suggested by the OP.

Indeed, you are right so far that it *might* be a Rails question. Still, I wonder why (in general) it is not possible to change the encoding of an existing (already open) Tempfile. Assuming that it is OK to "rewind" the file, I don't see a technical reason, why this is not possible.

I don't think it would be a good idea to configure this on the Rails side. Image the following scenario: We have a website, which allows users to upload textfiles, the content of which will eventually go into the database. Since we are generous about the encoding, we also provide the user with a dropdown list to choose a suitable encoding.

When the user clicks the upload button, the controller gets the uploaded file plus information about the encoding. Clearly, Rails can not anticipate the encoding of the file. It just can upload the file (binary), and provide the controller with an open file handle.

Now Ruby *does* have the set_encoding method for File, and Tempfile is-a file, and set_encoding *can* be called - it just fails. We have nearly everything in place. Now, if we can find out WHY set_encoding fails (and this might be a generic Ruby question), we can find out what Rails (or the programmer) can do to let things go smoothly....

Ronald

OK, so after some digging. It seems that when you create your new File object and set the encoding you may not need to read the Tempfile in its entirety. You can create a new File object using the Tempfile. File.new(my_temp_file, encoding: ‘utf-8’) and then use this file. It should be using the Tempfile and just creating a new pointer to that file with a new encoding. If you wanted to read the lines out individually and just use that original Tempfile you could use force_encoding(‘utf-8’) on each line to make sure it is converting them to utf-8.

I think, the idea of first reading the lines, and then use force_encoding on the strings, would not work for two reasons:

1. As I have experienced, I already get the exception on the first byte which has the high-bit set (i.e. is not 7-bit ASCII)

2. If a UTF8 encoded character contained 0x0d or 0x0a, reading the line without being aware of the encoding, would "split" the character into two parts.

In addition, this solution would not account for a BOM (unless I write special logic to extract an optional BOM on the first line being read). Although I have only files without BOM at the moment, it is likely that sooner or later I will also have to support uploading of files which contain a BOM.

Ronald