feedback on a few ActiveSupport::Multibyte patches

Hi all,

In response to Rodrigo Rosas's message about mb_chars.upcase not
giving the expected result on 1.9, I've done some work in a fork to
make String#mb_chars always return an instance of a proxy class, both
with Ruby 1.8 and Ruby 1.9. The end result of the patch is
(hopefully) to make Rails' multibyte functionality behave the same way
in 1.8.7 and 1.9.x.

http://github.com/norman/rails/tree/multibyte

Basically, the problem is that with current edge Rails and 1.9.x,
`"café".mb_chars.upcase` will return "CAFé" rather than the expected
"CAFÉ".

In my changes, the proxy class leaves some methods undefined for 1.9
because they have a native equivalent, but redefines a few others
because either they are buggy or, like String#upcase, don't have the
same behavior as AS::Multibyte::Chars.

Additionally, I refactored all of the Unicode support in ActiveSupport
into a new module, ActiveSupport::Multibyte::Unicode. This makes some
useful functionality like UTF-8
normalization/composition/decomposition easier to reuse since it's no
longer bound to the ActiveSupport::Multibyte::Chars class.

I'd be very grateful for any feedback.

Regards,

Norman

Norman, I checked out your multibyte branch but it is not working for me. Here is what I did:

$ cd ~/src/rails
$ git remote add norman http://github.com/norman/rails.git
$ git remote update
$ git checkout norman/multibyte -b multibyte
$ rvm ruby-head
$ gem install thor bundle
$ ruby bin/rails ~/temp/multibyte --dev
$ cd ~/temp/multibyte
$ script/rails c
$ > 'ação'.mb_chars.upcase # yields 'AO' instead of 'AÇÃO'
$ > 'ação'.mb_chars.class # yields ActiveSupport::Multibyte::Chars - OK

Any ideas?

Also, from the diffs between master and your branch I could realize that there is a lot of multibyte code in ActiveSupport. Maybe this could be put in an external gem on which AS would depend of. It would make AS cleaner and it would allow testing other gems as proxies... For instance, when running on JRuby, it would probably be better to have a different approach since strings in Java are unicode and String#toUpperCase() would already give the expected results... Any thoughts?

Thank you for your effort on correcting this multibyte issue for Ruby 1.9 on Rails,

Rodrigo.

Norman, I checked out your multibyte branch but it is not working for me.
Here is what I did:
<...>
Any ideas?

No, not off the top of my head. But I'll retrace your steps and see if
I get the same problems. Thanks for looking into it and getting back
to me with your detailed feedback. :slight_smile:

Also, from the diffs between master and your branch I could realize that
there is a lot of multibyte code in ActiveSupport. Maybe this could be put
in an external gem on which AS would depend of. It would make AS cleaner and
it would allow testing other gems as proxies... For instance, when running
on JRuby, it would probably be better to have a different approach since
strings in Java are unicode and String#toUpperCase() would already give the
expected results... Any thoughts?

I don't think there's "a lot" of multibyte code in ActiveSupport, it's
around 1000 lines, or roughly twice the size of inflector. Maintaining
it in a separate gem would be more project management overhead, for
something that doesn't usually see a lot of developer activity and is
going to be required anyway. Also, it's very easy to write your own
proxy classes if you want, for example, to use one the relies on
Java's native string handling for JRuby. I wouldn't be opposed if the
Rails team wanted to do that, but I just don't see any significant
benefit.

-Norman

I just checked this out and it is working correctly for me. I'm not
sure where things are going wrong for you, but I'm unable to reproduce
your problem. Here's more or less what I just did:

cd ~/work/rails
git checkout master
git pull origin master
git checkout multibyte
git rebase master
cd activesupport
rvm ruby-head
rake test # this pukes because of recent changes to String
rvm 1.9.2
rake test # segfault
rvm 1.9.1
rake test # ok, all tests pass.
cd ..
ruby bin/rails /tmp/mb --dev
cd /tmp/mb

now create temp.rb with following contents:
# encoding utf-8
puts 'ação'.mb_chars.upcase

ruby script/rails runner temp.rb #works
rvm ruby-head
bundle install
ruby script/rails runner temp.rb # also works
rvm ree
ruby script/rails runner temp.rb # also works

These are the Rubies I have installed (I'm on 64-bit Snow Leopard)

$ rvm list

rvm Rubies

   jruby-1.4.0 [ [x86_64-java] ]
   ree-1.8.7-2010.01 [ x86_64 ]
   ruby-1.8.6-p399 [ x86_64 ]
   ruby-1.9.1-p243 [ x86_64 ]
   ruby-1.9.1-p378 [ x86_64 ]
   ruby-1.9.2-preview1 [ x86_64 ]
=> ruby-head [ x86_64 ]

System Ruby

   system [ x86_64 i386 ppc ]

-Norman

HEm 12-05-2010 13:02, Norman Clarke escreveu:

Using this approach (a runner with a file specifying the encoding) your branch works at my work too.

But at home, I can run 'ação'.mb_chars.upcase in rails console and it works too. At work, 'ação'.mb_chars yields 'ao'. Any idea why this is not consistent in both environments?

Thanks,

Rodrigo.

If you're trying it on the console, it's probably a difference in the
way your consoles are set up to handle UTF-8 characters. I think the
only really reliable way to test this is by putting the text in a
file.