feedback on a few ActiveSupport::Multibyte patches

Hi all,

In response to Rodrigo Rosas's message about mb_chars.upcase not giving the expected result on 1.9, I've done some work in a fork to make String#mb_chars always return an instance of a proxy class, both with Ruby 1.8 and Ruby 1.9. The end result of the patch is (hopefully) to make Rails' multibyte functionality behave the same way in 1.8.7 and 1.9.x.

http://github.com/norman/rails/tree/multibyte

Basically, the problem is that with current edge Rails and 1.9.x, `"café".mb_chars.upcase` will return "CAFé" rather than the expected "CAFÉ".

In my changes, the proxy class leaves some methods undefined for 1.9 because they have a native equivalent, but redefines a few others because either they are buggy or, like String#upcase, don't have the same behavior as AS::Multibyte::Chars.

Additionally, I refactored all of the Unicode support in ActiveSupport into a new module, ActiveSupport::Multibyte::Unicode. This makes some useful functionality like UTF-8 normalization/composition/decomposition easier to reuse since it's no longer bound to the ActiveSupport::Multibyte::Chars class.

I'd be very grateful for any feedback.

Regards,

Norman

Norman, I checked out your multibyte branch but it is not working for me. Here is what I did:

$ cd ~/src/rails $ git remote add norman http://github.com/norman/rails.git $ git remote update $ git checkout norman/multibyte -b multibyte $ rvm ruby-head $ gem install thor bundle $ ruby bin/rails ~/temp/multibyte --dev $ cd ~/temp/multibyte $ script/rails c $ > 'ação'.mb_chars.upcase # yields 'AO' instead of 'AÇÃO' $ > 'ação'.mb_chars.class # yields ActiveSupport::Multibyte::Chars - OK

Any ideas?

Also, from the diffs between master and your branch I could realize that there is a lot of multibyte code in ActiveSupport. Maybe this could be put in an external gem on which AS would depend of. It would make AS cleaner and it would allow testing other gems as proxies... For instance, when running on JRuby, it would probably be better to have a different approach since strings in Java are unicode and String#toUpperCase() would already give the expected results... Any thoughts?

Thank you for your effort on correcting this multibyte issue for Ruby 1.9 on Rails,

Rodrigo.

Norman, I checked out your multibyte branch but it is not working for me. Here is what I did: <...> Any ideas?

No, not off the top of my head. But I'll retrace your steps and see if I get the same problems. Thanks for looking into it and getting back to me with your detailed feedback. :slight_smile:

Also, from the diffs between master and your branch I could realize that there is a lot of multibyte code in ActiveSupport. Maybe this could be put in an external gem on which AS would depend of. It would make AS cleaner and it would allow testing other gems as proxies... For instance, when running on JRuby, it would probably be better to have a different approach since strings in Java are unicode and String#toUpperCase() would already give the expected results... Any thoughts?

I don't think there's "a lot" of multibyte code in ActiveSupport, it's around 1000 lines, or roughly twice the size of inflector. Maintaining it in a separate gem would be more project management overhead, for something that doesn't usually see a lot of developer activity and is going to be required anyway. Also, it's very easy to write your own proxy classes if you want, for example, to use one the relies on Java's native string handling for JRuby. I wouldn't be opposed if the Rails team wanted to do that, but I just don't see any significant benefit.

-Norman

I just checked this out and it is working correctly for me. I'm not sure where things are going wrong for you, but I'm unable to reproduce your problem. Here's more or less what I just did:

cd ~/work/rails git checkout master git pull origin master git checkout multibyte git rebase master cd activesupport rvm ruby-head rake test # this pukes because of recent changes to String rvm 1.9.2 rake test # segfault rvm 1.9.1 rake test # ok, all tests pass. cd .. ruby bin/rails /tmp/mb --dev cd /tmp/mb

now create temp.rb with following contents: # encoding utf-8 puts 'ação'.mb_chars.upcase

ruby script/rails runner temp.rb #works rvm ruby-head bundle install ruby script/rails runner temp.rb # also works rvm ree ruby script/rails runner temp.rb # also works

These are the Rubies I have installed (I'm on 64-bit Snow Leopard)

$ rvm list

rvm Rubies

   jruby-1.4.0 [ [x86_64-java] ]    ree-1.8.7-2010.01 [ x86_64 ]    ruby-1.8.6-p399 [ x86_64 ]    ruby-1.9.1-p243 [ x86_64 ]    ruby-1.9.1-p378 [ x86_64 ]    ruby-1.9.2-preview1 [ x86_64 ] => ruby-head [ x86_64 ]

System Ruby

   system [ x86_64 i386 ppc ]

-Norman

HEm 12-05-2010 13:02, Norman Clarke escreveu:

Using this approach (a runner with a file specifying the encoding) your branch works at my work too.

But at home, I can run 'ação'.mb_chars.upcase in rails console and it works too. At work, 'ação'.mb_chars yields 'ao'. Any idea why this is not consistent in both environments?

Thanks,

Rodrigo.

If you're trying it on the console, it's probably a difference in the way your consoles are set up to handle UTF-8 characters. I think the only really reliable way to test this is by putting the text in a file.