mb_chars.upcase and Ruby 1.9.2

I'm testing ruby-head through rvm but can't get 'ação'.mb_chars.upcase == 'AÇÃO'... I get 'AçãO' instead...

This happens both for Rails 2.3.5 and Rails 3 beta 3...

How can I get upcase to work correctly?

Thanks in advance,

Rodrigo.

in 1.9 mb_chars simply returns self. This behaviour is coming
straight from ruby core:

http://github.com/rails/rails/blob/master/activesupport/lib/active_support/core_ext/string/multibyte.rb#L53-65

Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5 behavior the same in Ruby 1.9?

This is important for virtually any non-english application... Are there any plans for integration some library for achieving the same results as Rails currently supports?

Rodrigo.

My understanding is that ruby 1.9 is meant to support all these
operations internally, our mb_chars functionality was only ever
intended as a stop-gap until ruby itself could do native multi-byte
aware string operations. So what you're seeing are bugs in ruby which
should be fixed there, we probably shouldn't be maintaining a second
multi-byte aware library.

Not a solution, and perhaps you're already aware of this, but as a
workaround to these issues you can get an instance of
ActiveSupport::Multibyte::Chars and perform the operations you need:

ActiveSupport::Multibyte::Chars.new("café").upcase

This lets you use the same methods that would be used on Ruby 1.8.

Regards,

Norman

Please, take a look at this documentation for String#upcase:

http://ruby-doc.org/ruby-1.9/classes/String.html#M000593

"Returns a copy of str with all lowercase letters replaced with their uppercase counterparts. The operation is locale insensitive—*only characters ``a’’ to ``z’’ are affected*. Note: case replacement is effective only in ASCII region."

It doesn't seem Ruby 1.9 will change this behavior, so Rails should keep using its Proxy approach while Ruby doesn't support it itself.

My guess is that mb_chars should be set on Rails initialization with something like:

def mb_chars
self
end

String.send :include, StringMultiBytePatch unless 'ação'.upcase == 'AÇÃO'

Of course this is not the real code, but a suggestiong of an approach... The StringMultiBytePatch module would override mb_chars to use ActiveSupport::Multibyte::Chars proxy as noted by Norman Clarke.

Please, see also this thread from 2008:
http://old.nabble.com/String-upcase-downcase-with-UTF-8-strings-in-Ruby-1.9-td18372062.html

Hi Norman, while this seem to work with Rails 3 beta, it didn't work with rails 2.3.5 in my tests...

Any idea of why is this behavior different between 2.3.5 and 3?

Thanks,

Rodrigo.

I had been considering working a patch to add a "light" proxy class
for 1.9.x that uses some but not all of the method in the proxy class
for 1.8.

If it's true that there are no plans to add UTF-8 case-folding to Ruby
1.9 then I think it would be a good idea. I've been working on
multibyte a bit lately and would be happy to work on it some more if
folks think it would be useful.

There are also a couple of pedantic issues with AS's case folding,
such as incomplete support for Greek and Turkic languages, that I'd
like to fix. I'll look into it this week to see if maybe that would be
worthwhile as well.

-Norman

Interesting, I didn’t realize this was going to change in 1.9.2. While I sympathize with Matz for not wanting to step into the minefield that is case folding, I’m a bit disappointed. With no built-in support for that, or normalization, Ruby’s UTF-8 support is so weak that I find myself relying on AS more and more, even outside Rails apps.

I had considered working on a light multibyte proxy class for 1.9 when 1.9.1-p343 broke String#center and a few other methods, but decided against it when I saw it fixed in 1.9.2. AS’s case folding is a little lacking too, because it doesn’t implement case folding for Greek and Turkic as recommended for Unicode 5.1.

I’ve been hacking on multibye quite a bit lately and would be happy to take a longer look if folks think it’s worthwhile.

-Norman

sympathize with Matz for not wanting to step into the minefield that is case

... Sorry for the double post, Looks like I accidentally sent an
earlier draft from my phone.

I'd say that developing this as part of the I18n gem or even standalone would be better than as part of rails, as it would be very useful outside of rails, and not everybody who uses rails would need this functionality.

I agree that writing this in I18n or a standalone library would probably be better because of you first argument, but not for the last one...

Rails has an approach different from Merb or Sinatra in the way it is a full-stack framework. I believe multibyte support would be more useful for most people than REST support, for instance...

But since AS is also an independent library and could be used outside Rails too, I don't see any problems in patching String in AS... But I think it would be cleaner if it was an independent library that could be used inside I18n or AS gem...

Rodrigo.

These two libraries provide pretty good support for UTF-8 manipulation:

http://github.com/blackwinter/unicode
http://github.com/lang/unicode_utils

Yoshida Masato's is written in C and provides good performance, while
Stefan Lang's is written in Ruby and also appears to provide support
for proper UTF-8 case folding, so there's probably no need to
duplicate the effort of adding that to AS; it should be easy enough to
just implement proxy classes that use them, and make AS use them in
place of its default proxy class:

ActiveSupport::Multibyte.proxy_class = PutativeUnicodeProxyClass
ActiveSupport::Multibyte.proxy_class = PutativeUnicodeUtilsProxyClass

But I do think that Rails should still provide decent support for case
folding, and the behavior of commonly-used things like #upcase and
#downcase should not change so dramatically when you use Ruby 1.9 vs
1.8. It would be pretty simple to extract some methods from
Multibyte::Chars into a module that can be shared between the current
feature-rich proxy class for 1.8 and a thinner one for 1.9.

-Norman

Agreed. Is it possible in Bundler to add dependency to either unicode or unicode_utils gem? This should work as script/server, in Rails 2. If it finds a mongrel, use it, othercase, use webrick... If the faster C implementation is available, use it, else try the pure Ruby alternative... Is it possible?

Rodrigo.

AS::Multibyte currently implements two things: encoding aware string
operations and Unicode algorithms. 1.9 only implements encoding aware
string operations. We could activate the proxy with the Unicode
operations for 1.9, that should solve most people's problems.

I don't really like the idea of depending on external libraries for
this kind of functionality because the most used algorithms are
already defined in Multibyte.

Manfred

I agree. I was thinking more about implementing proxy classes for them
in a separate library that people could use, for example, if they
needed either the high performance of the library written in C, or the
proper case-folding for Greek and Turkic that the other one provides.

-Norman

That's more or less how it's right now. The C implementation is called
Unichars: http://github.com/Manfred/unichars.

Interesting, I didn't realize this was going to change in 1.9.2.

1.9.2's feature is already froze and it doesn't have such Unicode utilities.

We ruby-core know such needs for Unicode utility and had some discussion
about it but we can't agree its spec and implementation.
I think it needs more time.

While I
sympathize with Matz for not wanting to step into the minefield that is case
folding, I'm a bit disappointed. With no built-in support for that, or
normalization, Ruby's UTF-8 support is so weak that I find myself relying on
AS more and more, even outside Rails apps.

I had considered working on a light multibyte proxy class for 1.9 when
1.9.1-p343 broke String#center and a few other methods, but decided against
it when I saw it fixed in 1.9.2. AS's case folding is a little lacking too,
because it doesn't implement case folding for Greek and Turkic as
recommended for Unicode 5.1.
I've been hacking on multibye quite a bit lately and would be happy to take
a longer look if folks think it's worthwhile.

FYI:
If you implement case folding for greek and Turkic, a string (or something),
the string needs language information. Selecting font, calculating width,

Hi all,

I submitted a patch to fix the upcasing issue with 1.9 about a week
ago[1], but haven't gotten any followup yet. I saw today that there's
been some more work on this area, so my patch now conflicts with Rails
master.

If somebody has the time and inclination, could you let me know if
there's any interest in including my changes? In addition to resolving
the issue with upcasing on Ruby 1.9, I added an
ActiveSupport::Multibyte::Unicode module to contain the class methods
from ActiveSupport::Multibyte::Chars, and then moved in some related
functionality to the module for the sake of consistency.

I'm happy to resolve the conflicts to make the patch apply again, but
if people don't like the direction my refactoring went and don't want
to include the changes, then no problem, I'll just kill my branch[2]
and won't bother resolving the conflicts.

Either way, I think it would still be ideal to get a fix for the
upcasing issue before 3.0 is released.

Regards,

Norman

[1] https://rails.lighthouseapp.com/projects/8994/tickets/4595-stringmb_charsupcase-doesnt-upcase-non-ascii-chars-on-with-ruby-19x
[2] http://github.com/norman/rails/commit/f01dd100a7853e9bb5c7eb9097068ddb9ed1909d

Norman, take a look at the above link. It seems Jeremy is willing to accept your patch. Please rebase agains master again.

Best regards,

Rodrigo.