Multi Byte Strings

Hey guys,

We've been talking about the multi-byte patch and I think it's time to get feedback from you guys on a possible way forward.

We can include ActiveSupport::Multibyte with rails 1.2, and update all of the relevant helpers to use the String#chars proxy. This will mean that none of the action view helpers will mangle multibyte strings. Similarly, if any Strings are being mangled in ActiveRecord or anywhere else, we'll accept patches to fix them.

But any iconv conversions in actionpack, or database encoding changes will be left to plugin authors. Perhaps by the time that rails 2.0 comes around these plugins will have gained critical mass and best-practises will have emerged, letting us add them to the core. Similarly, encodings other than utf-8 can be provided by plugins.

Comments?

If Rails is going this direction, I think this is the right way to approach it.

Can't David just get the world to conform to using English?

He's king of the internet, not the world.

Sorry I can never go back to ascii again, I need my smiley faces, skulls and wheelchair glyphs!

Manfred

Even English is a lot nicer with proper punctuation, math symbols, 44 different star symbols and let's not forget Skull and Crossbones: :skull_and_crossbones:

Kind regards, Thijs

PGP.sig (186 Bytes)

Even English is a lot nicer with proper punctuation, math symbols, 44 different star symbols and let's not forget Skull and Crossbones: :skull_and_crossbones:

Yeah, there's no way I'm building apps without the Skull and Crossbones.

Ok, no one seemed to object to our plan, it's now 'the plan'. I'll merge ActiveSupport:Multibyte this evening some time. After that we'll take individual patches for all the cases in the other components which mangle multibyte strings.

To make it easier to merge each of those patches should:

* Address one helper / bug * Contain unit tests * have a keyword of multibytebug

Then I'll make an effort to review and merge them quickly. Also, get the word out that this is the last chance to object or provide feedback on our multi byte plan.

I assume you would not be opposed to a patch that checks for JRuby and defers unicode processing to our built-in support, right? It wouldn't cause any additional overhead for MRI, but could make unicode string processing quite a bit faster under JRuby.

I'm not sure what such a patch would look like, but I wanted to know that it wouldn't be rejected offhand if we came up with something.

Might this be better done as a plugin or similar?

-- tim lucas

I'm not sure what such a patch would look like, but I wanted to know that it wouldn't be rejected offhand if we came up with something.

Sure, assuming it's elegant and low/no overhead, I can't see anything wrong with that.

Assuming it's possible to do in a plugin, that is. Ideally, though, I don't want JRubyists to be penalized because Ruby can't support unicode natively, which is what the requirement to install an additional plugin basically amounts to.

If we can provide the Chars implementation through native means, all it would require is calling JRuby's built-in code rather than deferring to the pure Ruby version, and everything else should remain the same.

All hypothetical at the moment, but we'll hopefully get something concrete out soon.

The chars accessor is added to the string in a core_ext. You can easily replace the chars method on String for JRuby support.

   class String      def chars; self; end    end

The biggest problem with this is that you will not be able to use the normalization routines. Another solution would be to register a new backend handler with ActiveSupport::Multibyte which goes something like this:

   ActiveSupport::Multibyte::Chars.handler = MyHandler

Although registering the JRuby string methods as a handler would impose some method calling overhead.

Manfred

I'd be willing to bet a pure-Java implementation of the handler methods would more than make up for the overhead. Thanks for the tips!

You will be on the safe side of things if you implement it as a Handler form Multibyte, you can do it later when the codebases are merged.

If we can provide the Chars implementation through native means, all it would require is calling JRuby's built-in code rather than deferring to the pure Ruby version, and everything else should remain the same.

All hypothetical at the moment, but we'll hopefully get something concrete out soon.

We need to make this work as an extension that's part of running JRuby. We can't have JRuby specific conditionals in Rails proper. That's just opening the gates to hell. Let's definitely figure out how to use Ruby's dynamic nature to make this work. Even changing the code to make it easier to overwrite, ala the handler suggestion.

I agree, and I never meant that it should "if jruby do something". That would accomplish nothing for other implementations as they begin to build out their own support for Unicode...including MRI. It would be more along the lines of checking for an existing String#chars implementation or an existing Handler (something less unpleasant than a factory pattern, hopefully) and deferring to that implementation instead.

I'm just looking for a way for JRuby's Unicode capabilities to be fully leveraged by Rails without requiring plugins or hacks. I think the suggestions on this list can be made to work with what we have in JRuby. In fact, I'll try to come up with a Chars-compatible interface today, to see how easily it maps to Java's String.

Let me know if you need any help.

Manfred

Ok, no one seemed to object to our plan, it's now 'the plan'. I'll merge ActiveSupport:Multibyte this evening some time. After that we'll take individual patches for all the cases in the other components which mangle multibyte strings.

To make it easier to merge each of those patches should:

* Address one helper / bug * Contain unit tests * have a keyword of multibytebug

This has now been applied:

http://dev.rubyonrails.org/changeset/5223

Please get working on test cases and bug fixes for the parts of rails which mangle multibyte strings.

Thanks to Julian Tarkhanov, Manfred Stienstra & Jan Behrens for their work.

Michael Koziarski wrote:

> Ok, no one seemed to object to our plan, it's now 'the plan'. I'll > merge ActiveSupport:Multibyte this evening some time. After that > we'll take individual patches for all the cases in the other > components which mangle multibyte strings. > This has now been applied:

http://dev.rubyonrails.org/changeset/5223

I've created a short movie to show some of the features of the chars accessor. Might be a nice introduction to the world of multibyte safeness in Ruby (:

http://www.fngtps.com/2006/10/activesupport-multibyte

Thanks, Manfred

We find this totally awesome, thx Michael

Hi,

I just encountered my first MultiByte problem with Rails <= 1.1 I guess I have been lucky. I am just wondering if ActiveSupport::MultiByte fix this specific case.

render_text _(“Rename selected %s to `%s’ now.”) % [params[:item_type], params[‘name’]]

the two params[‘xx’] contains Japanese strings. and are displayed as

%u65D7%u9F13

but since I never had this problem before, I doubt the problem is actually the % operator. while it may be. I think it is important to note this is happening through Ajax. which might be causing the problem as well.

Would this work under rails 1.2 with ActiveSupport::MultiByte and is it possible to install ActiveSupport::MultiByte on rails 1.1.6 ?

thanks