ActiveSupport::Multibyte for better Unicode support

Three months ago Julian Tarkhanov submitted a test implementation of his ActiveSupport::Multibyte string extension patch. Since then we've been steadily improving the extension based on the feedback we received.

The code has been completely refactored to be more transparent and easier to understand. There is now a single optional accelerated backend and all multibyte-safe operations have a pure Ruby implementation. Test structure and coverage has also been greatly improved.

ActiveSupport::Multibyte is available as a plugin and can be converted to a patch using the included 'create_patch' rake task.

We would like to see ActiveSupport::Multibyte included in Rails so that developers can start depending on it for simpler and better Unicode support.

The ticket for the patch is at http://dev.rubyonrails.org/ticket/6242. More information and code can be found at https://fngtps.com/projects/multibyte_for_rails.

Manfred

We would like to see ActiveSupport::Multibyte included in Rails so
that developers can start depending on it for simpler and better
Unicode support.

I concur. Let this start an official request for comments. Any
objections to getting this into core?

I'm definitely keen to see this get added. However I'm a bit
concerned about the lack of discussion in this thread. It's a big
piece of work, and I was hoping more people would have opinions on it

I think that's the problem, because the codebase is pretty esotheric not much people want to dive in and give their opinion. I could explain on a global level, without gettting into all the details concerning encoding, what it does and what decisions were made during coding if anyone is interested.

Manfred

I'm interested in a general overview on what problem it fixes and why
it is needed. I don't know much about the whole unicode problem with
Ruby people keep bringing up and then other say it isn't really a
problem.

Peter

The ticket description already seems to be a very good general overview.
if my opinion count and this package has been well tested, I’d say “Add please”.
although if it only patches ruby, not rails, it could be a separate gems or a patch on ruby core/stdlib

Mathieu

Matz claims that Ruby currently has enough tools to deal with encoding. The problem is that you have to be an expert to do it right. The earliest Ruby is going to deal with encoding is in Rails 2.0 and that's not going to come out really soon. So this leaves the encoding problem with the application programmers. Even though I have to admit that I would rather see a good solution in Ruby core or in a stdlib, it's not going to happen. ActiveSupport::Multibyte is an attempt to make dealing with encoding simpler for the Rails (core) programmer, right now. It could also work as a deprecation mechanism when/if support for Ruby comes out.

If ActiveSupport::Multibyte would be released as a gem or standalone library, Rails code can't depend on it and we'd have to litter the code with if statements.

Manfred

make total sense. thanks

Peter,

The problems is correctly supporting multibyte strings. Unicode, the most complete character set, has several encodings (UTF-8 being the most popular one), each of them having some (or all) characters expressed with two or more bytes (unlike ASCII, for instance). In UTF-8, “abc” is a three-character string encoded in 3 bytes, but “čžš” (3 characters from Croatian alphabet) are encoded in 6 bytes (2 bytes each).

Multibyte-unaware programming languages (like Ruby and PHP < 6) assume 1 character = 1 byte. In Ruby, try string.reverse or string.length on strings containing special characters to see some unexpected results. Reverse will corrupt the string while length will report in bytes, not in characters. These are trivial examples, while the problem goes much deeper.

Rails needs this.

It appears this doesn't have any native/C code, but can you confirm
that in case I'm not looking hard enough? Obviously we JRubyists
wouldn't want anything in Rails to start requiring code we can't run.

Charles O Nutter wrote:

Three months ago Julian Tarkhanov submitted a test implementation of
his ActiveSupport::Multibyte string extension patch. Since then we've
been steadily improving the extension based on the feedback we received.

It appears this doesn't have any native/C code, but can you confirm
that in case I'm not looking hard enough? Obviously we JRubyists
wouldn't want anything in Rails to start requiring code we can't run.

How does JRuby handle strings? If they are mapped to java.lang.String, the JRuby already has more than adequate Unicode support.

It seems to me that .chars should return back the same object, if the underlying VM supports Unicode. I would guess that today that would include JRuby, and in the future, that would include Ruby 2.0.

Some day in the future, when Ruby 1.x is a distant memory, .chars should be deprecated, and ultimately removed.

- Sam Ruby

Some day in the future, when Ruby 1.x is a distant memory, .chars should
be deprecated, and ultimately removed.

That's definitely our intention, if JRuby is using java.lang.String,
then a simple plugin which does the following would be sufficient.

class String
  def chars
    self
  end
end

We'll update ActiveSupport to contain that (with appropriate
deprecation) when ruby 2.x comes to the party.

I'm definitely in favour of seeing something like this in core. Better unicode handling is needed yesterday! The chars proxy is a very nice way of handling this.

A question:

How does this compare to the unicode_hacks plugin? (See http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They seem very similar in both intent and interface.

Some comments:

Even with this plugin, supporting unicode in a Rails app is too complicated and fiddly. For those who haven't tried it, here are the steps:

- Make sure your database character set is utf8
- Make sure all your tables have a character set of utf8
- Make sure your database.yml has 'encoding: utf8' set for each database
- Put $KCODE='u' in your environment.rb
- Add an after_filter to application.rb to set the Content-Type header correctly
- Add 'normalize_unicode_params :form => :kc' to your application.rb

Missing one of these steps can produce strange results and corrupted data.

If unicode support is being included in core, then this needs to be rationalised. Ideally a single setting in environment.rb should take care of all of this. I also think it should be enabled by default. (Who doesn't want to support unicode nowadays?)

Rumour also has it that ActiveRecord, when recreating timed-out database connections, doesn't honour the 'encoding: utf8' setting. I've never run into this personally, so I assume it was fixed at some point?

Cheers,

Pete Yandell

Confirmed. All operations are implemented as pure Ruby.

Kind regards,
Thijs

PGP.sig (186 Bytes)

- Make sure your database character set is utf8
- Make sure all your tables have a character set of utf8
- Make sure your database.yml has 'encoding: utf8' set for each database

None of these steps are required officially unless you use utf-8
specific features of the database (collation). The last setting seems
to set the connection encoding, which shouldn't be required unless
there is non-utf8 data stored in the database.

- Put $KCODE='u' in your environment.rb

This is only required if you use unicode strings in your Ruby code.

- Add an after_filter to application.rb to set the Content-Type
header correctly

Rails now defaults to utf-8 Content-Type.

Joshua Sierles

ActiveSupport::Multibyte is a component of the Multibyte for Rails project which is basically the next version of the unicode_hacks plugin.should take

Kind regards,
Thijs

PGP.sig (186 Bytes)

- Make sure your database character set is utf8
- Make sure all your tables have a character set of utf8
- Make sure your database.yml has 'encoding: utf8' set for each database

None of these steps are required officially unless you use utf-8
specific features of the database (collation). The last setting seems
to set the connection encoding, which shouldn't be required unless
there is non-utf8 data stored in the database.

Not true! Collation and character set are separate things.

There are a couple of obvious reasons you want your database character set to be UTF8 if you're storing UTF8 strings:

1. When you access the database through the mysql (or pgsql, or other) command line, or through tools such as CocoaMySQL, you want strings to display properly.

2. MySQL never treats strings as binary; they always have a character set, which is latin1 (CP1252) by default. Putting UTF8 data into fields marked as latin1 seems like asking for trouble. (There are some byte values that are invalid in CP1252, so technically strings containing those bytes are illegal. It's only through MySQL's laziness in not checking the strings when the connection and table character sets match up that you can get away with this at all.)

There are even worse potential pitfalls here too. On one of our projects, we did everything except set the the connection encoding. What happened was that a UTF8 string in Rails would be regarded as CP1252 by MySQL, but MySQL knew that the tables needed UTF8, so it did a CP1252 to UTF8 conversion on the (already UTF8) string before writing it. As you can imagine, we ended up with all sorts of crap in the database, and the occasional string got completely munged as invalid CP1252 bytes were replaced with question marks.

These three things should at least be reduced to a single setting to avoid mistakes. I can't imagine a situation in which you would want to do one of them without the others.

- Put $KCODE='u' in your environment.rb

This is only required if you use unicode strings in your Ruby code.

If your app handles UTF8, then you're going to want to write tests involving UTF8 strings, so you're going to need this turned on. You do write UTF8 tests for your apps, right? :slight_smile:

- Add an after_filter to application.rb to set the Content-Type
header correctly

Rails now defaults to utf-8 Content-Type.

Good to know. I'll take this as an endorsement of the idea the UTF8 should be the default for Rails apps. :slight_smile:

Cheers,

Pete Yandell

I have to put in my two cents here. I can't see any reason why one
_wouldn't_ want to use UTF-8 over plain-ol' ASCII. It's a totally
different ball game than localization; I just want my users to be able
to input data using their own native characters. What app doesn't
have a "full name" field for a user? Shouldn't your users be able to
input their name properly? :slight_smile:

Besides implementation issues, I can't see any real downside to
supporting UTF-8 out of the box in Rails. It would sure avoid a lot
of potential issues...

Dave

It's a sancitioned evolution thereof. Manfred and Thijs overtook the business while I am plowing through my internship (which BTW has nothing to do with Rails and we-development). We split the repositories so that they can perform exhaustive code changes without hurting everyone sitting on unicode_hacks.

Tell it to the Japanese and the Chinese railers. I wonder how long you will stand before you get your ass served :slight_smile: