mb_chars.upcase and Ruby 1.9.2

Rodrigo_Rosenfeld_R1 · May 8, 2010, 3:00am

I'm testing ruby-head through rvm but can't get 'ação'.mb_chars.upcase == 'AÇÃO'... I get 'AçãO' instead...

This happens both for Rails 2.3.5 and Rails 3 beta 3...

How can I get upcase to work correctly?

Thanks in advance,

Rodrigo.

Michael_Koziarski1 · May 8, 2010, 3:04am

in 1.9 mb_chars simply returns self. This behaviour is coming straight from ruby core:

http://github.com/rails/rails/blob/master/activesupport/lib/active_support/core_ext/string/multibyte.rb#L53-65

Rodrigo_Rosenfeld_R1 · May 8, 2010, 4:03am

Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5 behavior the same in Ruby 1.9?

This is important for virtually any non-english application... Are there any plans for integration some library for achieving the same results as Rails currently supports?

Rodrigo.

Michael_Koziarski1 · May 8, 2010, 5:34am

My understanding is that ruby 1.9 is meant to support all these operations internally, our mb_chars functionality was only ever intended as a stop-gap until ruby itself could do native multi-byte aware string operations. So what you're seeing are bugs in ruby which should be fixed there, we probably shouldn't be maintaining a second multi-byte aware library.

Norman_Clarke1 · May 8, 2010, 12:57pm

Not a solution, and perhaps you're already aware of this, but as a workaround to these issues you can get an instance of ActiveSupport::Multibyte::Chars and perform the operations you need:

ActiveSupport::Multibyte::Chars.new("café").upcase

This lets you use the same methods that would be used on Ruby 1.8.

Regards,

Norman

Rodrigo_Rosenfeld_R1 · May 8, 2010, 3:24pm

Please, take a look at this documentation for String#upcase:

http://ruby-doc.org/ruby-1.9/classes/String.html#M000593

"Returns a copy of str with all lowercase letters replaced with their uppercase counterparts. The operation is locale insensitive—*only characters ``a’’ to ``z’’ are affected*. Note: case replacement is effective only in ASCII region."

It doesn't seem Ruby 1.9 will change this behavior, so Rails should keep using its Proxy approach while Ruby doesn't support it itself.

My guess is that mb_chars should be set on Rails initialization with something like:

def mb_chars self end

String.send :include, StringMultiBytePatch unless 'ação'.upcase == 'AÇÃO'

Of course this is not the real code, but a suggestiong of an approach... The StringMultiBytePatch module would override mb_chars to use ActiveSupport::Multibyte::Chars proxy as noted by Norman Clarke.

Please, see also this thread from 2008: http://old.nabble.com/String-upcase-downcase-with-UTF-8-strings-in-Ruby-1.9-td18372062.html

Rodrigo_Rosenfeld_R1 · May 8, 2010, 3:59pm

Hi Norman, while this seem to work with Rails 3 beta, it didn't work with rails 2.3.5 in my tests...

Any idea of why is this behavior different between 2.3.5 and 3?

Thanks,

Rodrigo.

Norman_Clarke1 · May 8, 2010, 5:31pm

I had been considering working a patch to add a "light" proxy class for 1.9.x that uses some but not all of the method in the proxy class for 1.8.

If it's true that there are no plans to add UTF-8 case-folding to Ruby 1.9 then I think it would be a good idea. I've been working on multibyte a bit lately and would be happy to work on it some more if folks think it would be useful.

There are also a couple of pedantic issues with AS's case folding, such as incomplete support for Greek and Turkic languages, that I'd like to fix. I'll look into it this week to see if maybe that would be worthwhile as well.

-Norman

Norman_Clarke1 · May 8, 2010, 6:39pm

Interesting, I didn’t realize this was going to change in 1.9.2. While I sympathize with Matz for not wanting to step into the minefield that is case folding, I’m a bit disappointed. With no built-in support for that, or normalization, Ruby’s UTF-8 support is so weak that I find myself relying on AS more and more, even outside Rails apps.

I had considered working on a light multibyte proxy class for 1.9 when 1.9.1-p343 broke String#center and a few other methods, but decided against it when I saw it fixed in 1.9.2. AS’s case folding is a little lacking too, because it doesn’t implement case folding for Greek and Turkic as recommended for Unicode 5.1.

I’ve been hacking on multibye quite a bit lately and would be happy to take a longer look if folks think it’s worthwhile.

-Norman

Norman_Clarke1 · May 8, 2010, 6:45pm

sympathize with Matz for not wanting to step into the minefield that is case

... Sorry for the double post, Looks like I accidentally sent an earlier draft from my phone.

Mateo_Murphy · May 8, 2010, 6:56pm

I'd say that developing this as part of the I18n gem or even standalone would be better than as part of rails, as it would be very useful outside of rails, and not everybody who uses rails would need this functionality.

Rodrigo_Rosenfeld_R1 · May 8, 2010, 7:31pm

I agree that writing this in I18n or a standalone library would probably be better because of you first argument, but not for the last one...

Rails has an approach different from Merb or Sinatra in the way it is a full-stack framework. I believe multibyte support would be more useful for most people than REST support, for instance...

But since AS is also an independent library and could be used outside Rails too, I don't see any problems in patching String in AS... But I think it would be cleaner if it was an independent library that could be used inside I18n or AS gem...

Rodrigo.

Norman_Clarke1 · May 8, 2010, 8:02pm

These two libraries provide pretty good support for UTF-8 manipulation:

http://github.com/blackwinter/unicode http://github.com/lang/unicode_utils

Yoshida Masato's is written in C and provides good performance, while Stefan Lang's is written in Ruby and also appears to provide support for proper UTF-8 case folding, so there's probably no need to duplicate the effort of adding that to AS; it should be easy enough to just implement proxy classes that use them, and make AS use them in place of its default proxy class:

ActiveSupport::Multibyte.proxy_class = PutativeUnicodeProxyClass ActiveSupport::Multibyte.proxy_class = PutativeUnicodeUtilsProxyClass

But I do think that Rails should still provide decent support for case folding, and the behavior of commonly-used things like #upcase and #downcase should not change so dramatically when you use Ruby 1.9 vs 1.8. It would be pretty simple to extract some methods from Multibyte::Chars into a module that can be shared between the current feature-rich proxy class for 1.8 and a thinner one for 1.9.

-Norman

Rodrigo_Rosenfeld_R1 · May 8, 2010, 9:00pm

Agreed. Is it possible in Bundler to add dependency to either unicode or unicode_utils gem? This should work as script/server, in Rails 2. If it finds a mongrel, use it, othercase, use webrick... If the faster C implementation is available, use it, else try the pure Ruby alternative... Is it possible?

Rodrigo.

Manfred_Stienstra · May 10, 2010, 7:12am

AS::Multibyte currently implements two things: encoding aware string operations and Unicode algorithms. 1.9 only implements encoding aware string operations. We could activate the proxy with the Unicode operations for 1.9, that should solve most people's problems.

I don't really like the idea of depending on external libraries for this kind of functionality because the most used algorithms are already defined in Multibyte.

Manfred

Norman_Clarke1 · May 10, 2010, 11:55am

I agree. I was thinking more about implementing proxy classes for them in a separate library that people could use, for example, if they needed either the high performance of the library written in C, or the proper case-folding for Greek and Turkic that the other one provides.

-Norman

Manfred_Stienstra · May 11, 2010, 7:30am

That's more or less how it's right now. The C implementation is called Unichars: http://github.com/Manfred/unichars.

NARUSE_Yui · May 13, 2010, 7:54am

Interesting, I didn't realize this was going to change in 1.9.2.

1.9.2's feature is already froze and it doesn't have such Unicode utilities.

We ruby-core know such needs for Unicode utility and had some discussion about it but we can't agree its spec and implementation. I think it needs more time.

While I sympathize with Matz for not wanting to step into the minefield that is case folding, I'm a bit disappointed. With no built-in support for that, or normalization, Ruby's UTF-8 support is so weak that I find myself relying on AS more and more, even outside Rails apps.

I had considered working on a light multibyte proxy class for 1.9 when 1.9.1-p343 broke String#center and a few other methods, but decided against it when I saw it fixed in 1.9.2. AS's case folding is a little lacking too, because it doesn't implement case folding for Greek and Turkic as recommended for Unicode 5.1. I've been hacking on multibye quite a bit lately and would be happy to take a longer look if folks think it's worthwhile.

FYI: If you implement case folding for greek and Turkic, a string (or something), the string needs language information. Selecting font, calculating width,

Norman_Clarke1 · May 21, 2010, 5:30pm

Hi all,

I submitted a patch to fix the upcasing issue with 1.9 about a week ago[1], but haven't gotten any followup yet. I saw today that there's been some more work on this area, so my patch now conflicts with Rails master.

If somebody has the time and inclination, could you let me know if there's any interest in including my changes? In addition to resolving the issue with upcasing on Ruby 1.9, I added an ActiveSupport::Multibyte::Unicode module to contain the class methods from ActiveSupport::Multibyte::Chars, and then moved in some related functionality to the module for the sake of consistency.

I'm happy to resolve the conflicts to make the patch apply again, but if people don't like the direction my refactoring went and don't want to include the changes, then no problem, I'll just kill my branch[2] and won't bother resolving the conflicts.

Either way, I think it would still be ideal to get a fix for the upcasing issue before 3.0 is released.

Regards,

Norman

[1] #4595 String.mb_chars.upcase doesn't upcase non-ASCII chars on with Ruby 1.9.x - Ruby on Rails - rails [2] http://github.com/norman/rails/commit/f01dd100a7853e9bb5c7eb9097068ddb9ed1909d

Rodrigo_Rosenfeld_R1 · May 21, 2010, 5:40pm

Norman, take a look at the above link. It seems Jeremy is willing to accept your patch. Please rebase agains master again.

Best regards,

Rodrigo.

Topic		Replies	Views
feedback on a few ActiveSupport::Multibyte patches rubyonrails-core	6	154	May 13, 2010
ActiveSupport::Multibyte handlers rubyonrails-core	0	106	April 3, 2008
don't unnecessarily override methods for 1.9.2 in AS::Multibyte::Chars rubyonrails-core patch	0	286	June 26, 2010
Multi Byte Strings rubyonrails-core	22	545	November 12, 2006
String#chars vs. String#chars rubyonrails-core	16	473	September 11, 2008

mb_chars.upcase and Ruby 1.9.2

Related topics

More Resources