ActiveSupport::Multibyte for better Unicode support

You mean they would get mad if Rails _did_ support UTF-8 out of the box?

Dave

You mean they would get mad if Rails _did_ support UTF-8 out of the box?

Yeah, UTF-8 and unicode aren't terribly popular in japan. For more information than you ever thought you'd want, you can read up on the Han unification. It's also much less efficient (space wise) than their 'legacy' encodings.

Why? It's not like Rails supports Japanese or Chinese encodings out of the box now. How is going from supporting just ASCII to supporting UTF-8 taking anything away from Japanese or Chinese railers?

Like David said, what exactly is the downside to default UTF-8 support? Who does it hurt, how, and why?

Pete Yandell

Michael Koziarski wrote:

You mean they would get mad if Rails _did_ support UTF-8 out of the box?

Yeah, UTF-8 and unicode aren't terribly popular in japan. For more information than you ever thought you'd want, you can read up on the Han unification. It's also much less efficient (space wise) than their 'legacy' encodings.

Java and C# seem to do OK in Japan.

I would also imagine that ASCII wouldn't be very popular in Japan. :slight_smile:

- Sam Ruby

Java and C# seem to do OK in Japan.

I would also imagine that ASCII wouldn't be very popular in Japan. :slight_smile:

I should clarify, don't take my previous statement as disagreeing with "utf-8 everywhere", I'm for it, not against it. But it's definitely not as simple an issue as it appears at first glance :wink:

ActiveSupport::Multibyte doesn't favor any encoding. It currently implements UTF-8 operations because that's what we, and a lot of other people on the web, use daily. We believe that you shouldn't implement anything you're not going to use yourself. This is also explained on our Trac page, in the FAQ.

https://fngtps.com/projects/multibyte_for_rails/wiki/FAQ

Manfred

And for good reason. I have yet to see an example of something that you can do in Shift-JIS and EUC that you can't do with Unicode 5 encoded as UTF-8. I'm not saying there are no issues some people feel strongly about, but there are certainly no compelling technical or practical reasons why you can't use Unicode in Japan.

Even so, Ruby supports Shift-JIS and EUC and will continue to. Because Rails gets so much out of Ruby it would be somewhat rude if the next Rails release were to make it impossible to use these encoding.

That's _exactly_ why ActiveSupport::Multibyte is designed to support multiple encodings. The only reason Shift-JIS and EUC are currently not implemented in ActiveSupport::Multibyte is that we don't feel comfortable building stuff we don't use.

So, if you need Shift-JIS or EUC, please add it to ActiveSupport::Multibyte and send us a patch.

For more information see the Multibyte for Rails FAQ:

https://fngtps.com/projects/multibyte_for_rails/wiki/FAQ

Kind regards, Thijs

PGP.sig (186 Bytes)

There's no downside to default UTF-8 support, but it would be nice if switching from the default to Shift-JIS or EUC is going to be as easy as changing $KCODE = 'utf-8' to $KCODE = 'sjis'.

If you want this, please add Shift-JIS and/or EUC support in ActiveSupport::Multibyte and send us a patch.

Kind regards, Thijs

PGP.sig (186 Bytes)

So, if you need Shift-JIS or EUC, please add it to ActiveSupport::Multibyte and send us a patch.

Other encodings can be support with plugins initially, I'm personally happy with utf-8 only as a position for 1.2.

> - Make sure your database character set is utf8 > - Make sure all your tables have a character set of utf8 > - Make sure your database.yml has 'encoding: utf8' set for each database

None of these steps are required officially unless you use utf-8 specific features of the database (collation). The last setting seems to set the connection encoding, which shouldn't be required unless there is non-utf8 data stored in the database.

> - Put $KCODE='u' in your environment.rb

This is only required if you use unicode strings in your Ruby code.

- Add an after_filter to application.rb to set the Content-Type header correctly

Rails now defaults to utf-8 Content-Type.

So, if we merged in ActiveSupport::Multibyte, and updated helpers like truncate to use the chars proxy, what other changes would be required to make this stuff simple? Normalisation of input parameters? Anything else?

It would be nice if we could make it really easy to have this stuff 'just work' without much in the way of additional user intervention.

Well, Normalization of input parameters depends on the situation. If you want to compare strings you probably want compatability normalization (like NFKC), but compatability normalization forms also looses data.

For instance, the ligature ffi:

"ffi".chars.normalize(:kc) #=> "ffi"

Or the 'vulgar fraction one quarter':

"¼".chars.normalize(:kc) #=> "1/4"

When you're comparing strings, you might want "¼" to be equal to "1/4". When you want your users to use nice glyphs, you can't just discard this data.

But _if_ you normalize, you have to make sure you _always_ normalize. For instance, when you save a password to the database and normalize it, you have to make sure that you always normalize passwords from forms otherwise the password might not match when filled out by the user. Using NFKC might introduce false positives because "¼".chars.normalize == "1/4".chars.normalize, which isn't a very large problem if the rest of the password is strong enough.

Currently normalization is implemented in a separate plugin called 'utf8_plugin' [1], and can be turned on by the class method `normalize_unicode_params'.

You can find more information in your Unicode Primer [2].

Manfred

[1] https://fngtps.com/svn/multibyte_for_rails/utf8_plugin [2] https://fngtps.com/projects/multibyte_for_rails/wiki/UnicodePrimer

ok so ActiveSupport::Multibyte would work with SJIS and EUC-JP but it seems some extra work from someone who understand those encodings.

well, I think if ActiveSupport::Multibyte gets integrated into rails with decent docs (docs that includes writting plugins for other encoding) I’m sure you have a lot more chance to see a Japanese guru sending you a patch. if it does not get integrated, they won’t know about it. or won’t care cuz it ain’t mainstream.

and I am using utf-8 a good 80% of the time anyway, so I’m totally with the motion.

+3

Normalization on input and before saving to the database, but this might scare some people off if used wrong. What Rails might do is adopt the Character Model for the Web and just stick to C normalizations everywhere.

However I think this still might stay optional, because this might raise exceptions and loose ends in the situations where people send intrinsic bytestrings as input parameters. What I do is I had defined input norm as a filter for ApplicationController, as the step in the chain responsible for input sanitization.

Implicit normalization at runtime is not the way because it transiently changes the offsets of strings as soon as you slice/truncate/concatenate.

KCODE, all response charsets out of the box UTF, maybe processing the params with iconv according to the request-charset. But first and foremost - clear documentation.

KCODE, all response charsets out of the box UTF, maybe processing the params with iconv according to the request-charset.

Is the request charset sent by all browsers for all requests? How risky is automatically translating with iconv (assuming it's available)? Incidentally, this is what I meant by normalization, that'll teach me to use a reserved word ;).

But first and foremost - clear documentation.

What do you feel is currently missing from the ActiveSupport::Multibyte patch?

So, if we merged in ActiveSupport::Multibyte, and updated helpers like truncate to use the chars proxy, what other changes would be required to make this stuff simple? Normalisation of input parameters? Anything else?

KCODE,

I agree. It's the Ruby way to set your encoding using $KCODE so Rails 1.2 should have $KCODE='utf-8' in environment.rb

all response charsets out of the box UTF,

This is already in trunk since changeset 5129.

maybe processing the params with iconv according to the request-charset.

This is only needed for very old and badly broken browsers. I don't think Rails should do this by default.

Kind regards, Thijs

It doesn't hurt _us_. I'm 200% for it anyways, just wanted to bring the point before anyone sneaks up on us about it.

KCODE, all response charsets out of the box UTF, maybe processing the params with iconv according to the request-charset.

Is the request charset sent by all browsers for all requests? How risky is automatically translating with iconv (assuming it's available)? Incidentally, this is what I meant by normalization, that'll teach me to use a reserved word ;).

I see almost no risk. t has to do with a browser (or a REST client, for that matter) using a wrong charset when doing the request. The server recieving the request can then decode the request into it's internal encoding. This is how (among others) Trackback system works in MovableType. But just as Thijs said. we might as well omit that.

It has nothing to do with normalisation.

But first and foremost - clear documentation.

What do you feel is currently missing from the ActiveSupport::Multibyte patch?

As one of the authors I feel pretty secure here. Just wanted to make sure the big README we have put there gets a visible spot in the AS docs.

How does JRuby handle strings? If they are mapped to java.lang.String, the JRuby already has more than adequate Unicode support.

JRuby does use java.lang.String, but we have to artificially downgrade everything to a single-byte encoding for Ruby's sake. Because there's no concept of characters versus bytes in Ruby, we can't really support multiybyte characters or code points or what have you without creating incompatible interfaces. It's a source of great frustration for us, so much so that we're probably just going to create some incompatibilities to solve the Unicode issue on our end. It's likely that in the future all strings in JRuby will be UTF-16 strings as in Java, and all operations will deal in characters instead of bytes whereever possible. We'll deal with issues that arise as they come up, such as for handling IO that wants byte counts when we're providing character counts.

It seems to me that .chars should return back the same object, if the underlying VM supports Unicode. I would guess that today that would include JRuby, and in the future, that would include Ruby 2.0.

chars would be easy to implement today; and really we may look at the ActiveSupport::MultiByte way to handle Unicode as "the one way" we also do it in JRuby. Rails is driving Unicode innovation at this point, so if this sees wider adoption we're not opposed to including it in core JRuby.

To be absolutely clear: we want to support Unicode natively in JRuby, and we're really just looking to the community to decide what form that should take. If there's something that can be done within Ruby 1.8-semantics that works with Ruby 1.8-compatible apps, we'll include it.