Ruby 1.9 + Rails 2.3.5 + UTF8 support a dead end?

Hi,

I was just curious to know what the status of UTF support in 2.3.5 + Ruby 1.9 was - from the core developers point of view.

Here are the impressions I have:

  - everyone is working on Rails 3, so no one really cares about 2.3,     except for serious bugs and security issues - and it looks like     1.9 issues don't count.

  - 2.3.6 still won't have decent UTF8 support 'out of the box' for     Ruby 1.9, because it probably won't include the UTF-8 patches     already in LH

  - 2-3-stable branch doesn't have the patches for 1.9 + UTF to work

  - since no one seriously considers ruby 1.9 ready for production,     nobody is going to spend time merging patches for 1.9 encoding     support, so sending patches is a waste of time

As many articles and posts suggest, many people have been bitten by all sorts of encoding problems with ruby 1.9. Of course LH has many patches and solutions, but monkey-patching required to workaround the problems is way too much for any rails newbie IMHO.

The trigger to write was an issue I currently have with ERB templates where rails is trying to concat together a UTF8 <label> with a ASCII <textarea> (which has 'escaped' UTF characters) within a form. I can get to the root of the problem, find an existing patch in LH or create a patch with test cases, etc, but that isn't the point.

Encoding problems are not really bugs, but they crash and burn the application. Even worse - some problems occur only on the production environment (e.g. where the default encoding turns out to be ASCII) and the error messages are far from suggesting what the problem is, or even how to fix it.

I was wondering if there was a way to save people a lot of grief. Especially people who are trying to mix 'a stable rails version' with Ruby 1.9. A few solutions that come into mind:

  1. Warn about using ROR 2.3 with Ruby 1.9 in UTF applications - tell   people to use 1.8.7 only or wait for Rails 3.

  2. Force encoding to UTF8 in concat (output_safety.rb) when there is   a UTF/ASCII mismatch - as an option in the configuration. Sure, this   will probably kill performance, but killing developer time is worse.   I realize this is ugly, but this may make 1.9 at least somewhat   usable without too much work on the 2-3-stable branch.

  3. Hunt down all the cases where ascii encoding is used and breaks   concatenation with UTF strings, release 2.3.6 or 2.3.7 with the   fixes - and close the bugs on LH.

  4. The inverse of 3 - get Rails to work properly with RUBYOPT=-KU -   currently this means adding 'encoding: ascii' to tmail files,   because of escaped characters.

  5. Create an 'unstable' branch for Rails 1.9, where patches   can be added sooner without too much worry for breaking things

  6. Create an official page on how to deal (or not deal) with UTF   problems in rails 2.3

Personally, I would go for 2 and have an option that tries to 'fix' obvious cases with encoding errors - or at least in a production environment.

I would be more than happy to put effort into solving these issues if the status of the bugs in LH wouldn't remain too discouraging.

Thank you for your time, Your attention and any suggestions,

Cezary Baginski

  • since no one seriously considers ruby 1.9 ready for production,

    nobody is going to spend time merging patches for 1.9 encoding

    support, so sending patches is a waste of time

All the “points” you listed basically just repeat what you stated here in your last observation.

Sending (quality) patches is never a waste of time. Patience is a virtue.

I completely agree with both statements.

And I also think that expecting people to prepare wonderful patches for rails 2.3.5 (released almost 5 months ago AFAIR) without some encouragement or directions would be a little too much. The reasons:

  1. Debugging which source file or line of code (part of rails or   not) emits a ASCII-8BIT string is very time consuming (since the   point of failure is very far from the cause). Without this, it is   difficult to determine if it already has a LH ticket or not.

  2. There are already many 1.9 tickets present in 2.3.5 with no   applicable 'solutions'- to list just some I have been bitten by   already, or stumbled upon when searching for existing   patches/duplicates:

    #1988 Make utf8 partial rendering from within a content_for work in ruby1.9 - Ruby on Rails - rails     #2188 Encoding error in Ruby1.9 for templates - Ruby on Rails - rails     #2476 ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1 - Ruby on Rails - rails     #3331 [PATCH] block invalid chars to come in rails app. - Ruby on Rails - rails     #3392 rack.input requires ASCII-8BIT encoded StringIO - Ruby on Rails - rails     #3941 "invalid byte sequence in US-ASCII" error for UTF-8 localized messages - Ruby on Rails - rails     #4336 Ruby1.9: submitted string form parameters with non-ASCII characters cause encoding errors - Ruby on Rails - rails

  3. 1.8.7 is recommended for Rails. That is ok. But although the   2.3.5 release notes mention 1.9, they don't state anything about   potential UTF-8 problems with Ruby 1.9 (except for people's   comments), nor do they suggest what to do with such problems (e.g.   'wait until X', 'we are waiting for patches', 'send test cases',   'use 1.8.7', 'try -KU option', 'you are on your own unless you only   use en_us'). And there is also no mention of how to report issues   effectively or which commit to use to avoid reporting something   already on LH.

  4. When using a combination of software (cucumber, webrat, rspec) it   may be *very* time consuming to even determine which gem is the   cause of the problem and which ones just send the problem further   down the call stack.

  5. It is unreasonable to expect people to not try Rails with Ruby   1.9(even if by accident) and the worst thing is that is *seems* to   work, until UTF8 characters are used somewhere (template, db, etc).   No warning is given if Ruby 1.9 is used. So the natural thing to   assume when something is that one's setup is wrong. Which is true -   it's using Ruby 1.9 in the first place.

  6. Although I don't want absolute morons to use Rails, having no   'fail-safe' or warning will just scare good developers from Rails   just wanting to try out the framework, even if the issues are not   Rails bugs. There is no 'recommended' set of patches to apply and   test before reporting bugs with Ruby 1.9.

  7. Most of the solutions you find for encoding problems with ROR and   Ruby 1.9 do not suggest the following: stick with 1.8, because   1.9 with Rails is a can of worms in this regard.

I was wondering if this isn't really something more suitable for ruby-core: it would be nice to know where the string causing the error was created and why a given encoding was selected. This could at least provide bug reports with better details regarding the root cause.

I am really not the brightest developer out there and I apologize for not being able to propose something more useful than just stating obvious problems.

My question is: how can I help in a meaningful way that isn't a complete waste of my time and that isn't a duplication of other people's efforts?

Since patches are never a waste of time, I propose the following?

My first patch would be a warning about using Ruby 1.9 with Rails. To save people grief when they install Ruby 1.9 as their default.

My second patch is to rescue an exception in concat (output_safety), work around it with force_encoding if it is sane and issue a warning. Just to try help solve other issues that just *seem* related.

Then I would put my efforts into discussing the issue on ruby-core if it would be possible to add location info (and reason for selected encoding: env, locale, magic, param, etc) for string creation on a test version of Rails - this may save many tens of thousands of man hours that would be wasted on debugging and help in the adoption of not only Ruby 1.9, but in good practices regarding supporting non-US languages in other gems.

Then I would build a special version of Ruby that warns whenever a string is not created as UTF-8 and isn't explicitly created as ASCII, fork Rails and start adding test cases.

Would this really be the best approach?

Thanks in advance.

Czarek, thanks for raising this. Working with 1.9 string encodings is too unforgiving.

Most of the patches so far amount to forcing UTF-8 or 8-bit ASCII everywhere. Not acceptable or even desirable.

I love the idea of trying to force the encoding on concat and giving a warning. It'd be wonderful is Ruby itself offered an encoding sniffer so we could attempt to transcode as well.

Then your app works plus you get the information you need to encourage the library author to add 1.9 support.

This would make for an excellent Ruby Summer of Code project.

jeremy

Czarek,

Thanks so much for this detailed email. I think it brings the issues people have been having into sharp relief. For some background, Ruby 1.9 has historically had a number of issues that were troublesome to Rails. In at least one case (Ruby 1.9 changed constant lookup in a very confusing and backward incompatible way), we were able to lobby ruby-core to backtrack on their decision.

A very useful thing for us to determine here would be what we could do in ruby-core to simplify this issue. One possible solution we’ve discussed has been to enable (and use) a default_source_encoding, which would (at very least) pick up the default language from the environment.

I have some more comments inline.

Yehuda Katz Developer | Engine Yard (ph) 718.877.1325

  • since no one seriously considers ruby 1.9 ready for production,

nobody is going to spend time merging patches for 1.9 encoding

support, so sending patches is a waste of time

All the “points” you listed basically just repeat what you stated here in

your last observation.

Sending (quality) patches is never a waste of time. Patience is a virtue.

I completely agree with both statements.

And I also think that expecting people to prepare wonderful patches for

rails 2.3.5 (released almost 5 months ago AFAIR) without some

encouragement or directions would be a little too much. The reasons:

We are still maintaining Rails 2.3.5, and will continue to do so for the near-future. Patches that add features to 2.3.x will probably be met with serious scrutiny after 3.0 is final, but patches which fix bugs in any supported version of Ruby (including 1.9.2, once it’s released), will continue to be considered.

  1. Debugging which source file or line of code (part of rails or

not) emits a ASCII-8BIT string is very time consuming (since the

point of failure is very far from the cause). Without this, it is

difficult to determine if it already has a LH ticket or not.

Yes. This blows. Again, I think this comes down to a poor choice for default source encoding (ASCII-8BIT). In my opinion, ruby-core should make the default source encoding UTF-8. If this causes backward compatibility issues, they should be handled in the Ruby code that introduces the issues, and allowing the user to change the default source encoding would probably be helpful as well.

  1. There are already many 1.9 tickets present in 2.3.5 with no

applicable ‘solutions’- to list just some I have been bitten by

already, or stumbled upon when searching for existing

patches/duplicates:

[https://rails.lighthouseapp.com/projects/8994/tickets/1988-make-utf8-partial-rendering-from-within-a-content_for-work-in-ruby19](https://rails.lighthouseapp.com/projects/8994/tickets/1988-make-utf8-partial-rendering-from-within-a-content_for-work-in-ruby19)


[https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038](https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038)


[https://rails.lighthouseapp.com/projects/8994/tickets/2476-ascii-8bit-encoding-of-query-results-in-rails-232-and-ruby-191](https://rails.lighthouseapp.com/projects/8994/tickets/2476-ascii-8bit-encoding-of-query-results-in-rails-232-and-ruby-191)


[https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app](https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app)


[https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio](https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio)


[https://rails.lighthouseapp.com/projects/8994/tickets/3941](https://rails.lighthouseapp.com/projects/8994/tickets/3941)

[https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors](https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors)

As Jeremy said, this entire process is far too error-prone. We need to work with ruby-core, before they release 1.9.2, to create a solution that doesn’t introduce this sort of problem. In my opinion (and you can quote me on this), 1.9.x is DOA until this problem is addressed in a way that does not lead to the sorts of tickets you showed above.

  1. 1.8.7 is recommended for Rails. That is ok. But although the

2.3.5 release notes mention 1.9, they don’t state anything about

potential UTF-8 problems with Ruby 1.9 (except for people’s

comments), nor do they suggest what to do with such problems (e.g.

‘wait until X’, ‘we are waiting for patches’, ‘send test cases’,

‘use 1.8.7’, ‘try -KU option’, 'you are on your own unless you only

use en_us’). And there is also no mention of how to report issues

effectively or which commit to use to avoid reporting something

already on LH.

I agree. I’d also point out that in the past year, attempting to maintain compatibility with 1.9.x has been extremely frustrating for Rails. In addition to feature problems (encodings, constant lookup), we’ve been met with repeated segfaults in both 1.9.1 and 1.9.2-*. Tracking down segfaults is tricky, and while rails-core needs to attempt to keep up with 1.9.2-head, you as a user should not be using a version of Ruby that is known to segfault in pure-ruby code. To be clear, you may have never encountered any segfaults, but we encounter them often when running the Rails test suite. Note that Rails itself is pure Ruby, and the problems we have had are invariably reproducible without any C extensions.

  1. When using a combination of software (cucumber, webrat, rspec) it

may be very time consuming to even determine which gem is the

cause of the problem and which ones just send the problem further

down the call stack.

Indeed. This is why the whack-a-mole solution is unacceptable. At this point, we’ve clearly demonstrated that the basic strategy of making String literals in Ruby source files 8-bit-ASCII and providing no mechanism (except file-for-file magic comments) is too unwieldy.

  1. It is unreasonable to expect people to not try Rails with Ruby

1.9(even if by accident) and the worst thing is that is seems to

work, until UTF8 characters are used somewhere (template, db, etc).

No warning is given if Ruby 1.9 is used. So the natural thing to

assume when something is that one’s setup is wrong. Which is true -

it’s using Ruby 1.9 in the first place.

I agree. That said, I would personally not run a production Rails application on Ruby 1.9.x until 1.9.2 is released and all known issues (especially the segfaults I mentioned above) are resolved. One thing that would make me feel more comfortable would be if ruby-core ran the Rails test suite against 1.9.2-head. I know they’re not obligated to do so, but it would make the process significantly more robust. Rails core (and specifically Carl and I) would happily invest whatever time needed to help the Ruby core team get (and stay) up and running with the Rails suite.

  1. Although I don’t want absolute morons to use Rails, having no

‘fail-safe’ or warning will just scare good developers from Rails

just wanting to try out the framework, even if the issues are not

Rails bugs. There is no ‘recommended’ set of patches to apply and

test before reporting bugs with Ruby 1.9.

Agreed. And to be clear, I don’t see any reason that someone who’s using PHP today shouldn’t be able to use Rails tomorrow.

  1. Most of the solutions you find for encoding problems with ROR and

Ruby 1.9 do not suggest the following: stick with 1.8, because

1.9 with Rails is a can of worms in this regard.

That is the recommended solution.

I was wondering if this isn’t really something more suitable for

ruby-core: it would be nice to know where the string causing the error

was created and why a given encoding was selected. This could at least

provide bug reports with better details regarding the root cause.

Tracking the origin of every String might be expensive. Perhaps a debug mode that did this would be helpful. That said, as I said above, I don’t believe that ASCII-8BIT is a good default for source files.

I am really not the brightest developer out there and I apologize for

not being able to propose something more useful than just stating

obvious problems.

Your ability to clearly articulate the problems puts you head and shoulders above most developers. Thank you very much for your efforts in clearly outlining the issues.

My question is: how can I help in a meaningful way that isn’t a

complete waste of my time and that isn’t a duplication of other

people’s efforts?

Since patches are never a waste of time, I propose the following?

My first patch would be a warning about using Ruby 1.9 with Rails. To

save people grief when they install Ruby 1.9 as their default.

That seems good. Would it be a warning in the initial Rails boot check (the one that blocks running Rails with 1.8.6 and below). That seem like the right place to me. We should perhaps have a more expansive explanation of the issues with 1.9 and encodings (possibly a guide) that we could link to.

My second patch is to rescue an exception in concat (output_safety),

work around it with force_encoding if it is sane and issue a warning.

Just to try help solve other issues that just seem related.

I’d want to see a log warning, in red, not just a Ruby warning that could be hidden. I’d like to discuss applying this solution to master as well. Would you mind hitting me up on GTalk (wycats@gmail.com).

Then I would put my efforts into discussing the issue on ruby-core

if it would be possible to add location info (and reason for selected

encoding: env, locale, magic, param, etc) for string creation on a

test version of Rails - this may save many tens of thousands of man

hours that would be wasted on debugging and help in the adoption of

not only Ruby 1.9, but in good practices regarding supporting non-US

languages in other gems.

I agree entirely. I will be happy to help lobby for these (or related) changes. Do you think it makes sense to change the default source encoding?

Then I would build a special version of Ruby that warns whenever a

string is not created as UTF-8 and isn’t explicitly created as ASCII,

fork Rails and start adding test cases.

I’d love to help you with whichever of these efforts you think my assistance would be valuable in. Again, please ping me.

Would this really be the best approach?

It sounds on the right track :slight_smile:

Thanks in advance.

Again, thanks for your efforts here. It’s too easy to get angry, post a rant, and just leave entirely (or privately seethe). Your post here is a model example of how I would personally like people to express their concerns about serious problems that seem to remain unaddressed (or underaddressed).

That reminds me of a something I was wondering about, what is the official recommended Ruby version for a production app? ruby 1.8.7 p72? ree? What about rails 3.0?

That is actually a very good idea for both projects. But there are a couple of caveats. First of all, it would not be reasonable for a copy of the rails source to be included in the ruby code repository. Thus like the rspec test suite, it would be an external test suite, but a couple of makefile commands could be provided to fetch the suite and run the tests.

Second it would be unreasonable to expect ruby-core to spend time debugging rails. Thus you should provide a branch for this purpose which is only resynced with trunk preriodcally, at points where there are no failures under 1.8.7, and no known failures under ruby-HEAD that are not caused by bugs in ruby itself.

For reasons of reproducability, this branch should have all pure-ruby depencies vendered in. Yes even though bundler can provide the reproducability, this should still be the case, so as to keep the whole thing as clean as possible from ruby-core's persective.

The branch should have no dependencies requiring C extentions. If that means some tests can't be run, too bad. At least the vast majority will be run.

If you are willing to set up such a branch, I'd tend to suspect ruby-core may be willing to use such a test suite.

The advantage to ruby-core is that they gain an additional test suite that more closely matches real world code than either of the current test-suites (the implementation test suite, and the rspec specification test suite). It would help detect changes that unintentionally break backwards compatibility, and in cases where segfaults occur, help in locating bugs in ruby itself.

Just my thoughts on the subject.

Czarek, thanks for raising this. Working with 1.9 string encodings is too unforgiving.

After working on #2188 I can say I know exactly what you mean :slight_smile:

Most of the patches so far amount to forcing UTF-8 or 8-bit ASCII everywhere. Not acceptable or even desirable.

I am currently trying to work out a way that could help people "screen" their patches that shows them in exactly which cases this won't work, without getting into too much m17n details. The amount of knowledge about encoding internals needed to provide a useful patch is too much to for people just trying to globalize their applications with Ruby 1.9.

I love the idea of trying to force the encoding on concat and giving a warning. It'd be wonderful is Ruby itself offered an encoding sniffer so we could attempt to transcode as well.

Then your app works plus you get the information you need to encourage the library author to add 1.9 support.

Great idea! Do you think it would also make sense to rescue concat errors, treat them as a bug with a link to a general LH ticket about reporting encoding errors? I mean handling concat errors is much too late and probably requires Artificial Intelligence :slight_smile:

This way we could:

  - quickly classify known and new problems

  - design reproducable test cases for new problems

  - design ways to prevent non-default-encoded string from entering     Rails

  - design a general sniffer for plugins/gems that could introduce     problems in places we cannot control/check otherwise

  - close Rails 1.9 encoding compatibility tickets and milestones

  - create new LH tickets for new errors beyond a specific Rails     version

This would make for an excellent Ruby Summer of Code project.

I wonder if one whole summer is enough :wink: