Overview of Ruby 1.9 encoding problem tickets

Cezary_Baginski · April 19, 2010, 1:58pm

SUMMARY:

bitsweat · April 19, 2010, 6:30pm

SUMMARY: --------

I tried to identify the general and root causes for these problems with 1.9, by taking into account non-utf encoding, current patches, comments and ideas. I used ticket #2188 as base for explanations.

This is a long read. I wanted to include all the relevant information in one place. I also included information about related tickets in LH and their status. I decided that adding parts of this to LH would just add to the confusion.

Two patches are included (one is from Andrew Grim) that should fix one issue (#2188) in a way, that fixes the problem and doesn't break anything. Two small steps for Rails, one giant step for proper encoding support. I hope.

I welcome any feedback that would help get Rails closer to fully supporting Ruby 1.9 and vice-versa.

SOLUTION: ---------

The general idea is: allow only one "internal" encoding in Rails at any given time, based on the default Ruby encoding (or configurable).

And treat any incoming external strings that cannot be converted to this "internal" encoding as errors in the gems, which they occur. And possibly report mismatches before they even "enter" Rails, by attempting to convert them into the "internal" encoding immediately.

As a result of enforcing this, all Rails tests should work with any encoding, that is a superset of the encodings used for input (db, Rack, ERB, Haml, ...) in a given environment.

With a optimal setup (db encoding, Ruby encoding, Rack encoding settings, I18n translations, ...), no transcoding will occur during the rendering process, no matter what the default Rails encoding is used (including ASCII_8BIT), and no force_encoding would be needed internally in Rails, except as workarounds for gems and libraries where this is difficult otherwise.

The guideline for gem and plugin developers would be: do not create or return strings (other than internal use) that are not compatible with the default encoding both ways.

In some cases, it may be acceptable to drop or escape characters that cannot be transcoded (maybe Rack input, for example).

+1

The idea is based on:

- Jeremy Kemper's strong attitude toward avoiding solutions requiring UTF-8 as default or forcing it

- Yehuda's opinion about using UTF-8 as default in Ruby instead of ASCII-8BIT

- James Edward Gray's solution for encoding issues in CSV

- the multitude of ways to set the encoding in Ruby

- giving everyone the liberty to use any encoding they want for any task, without the need of porting and modifying existing code if possible

- personal experience with many encoding pitfalls

For those interested in Ruby encoding support, I very much recommend the extremely well written in-depth article by James Edward Gray II:

Gray Soft / Not Found

Results of "Please do investigate": ----------------------------------

The ticket:

#2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates

Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an alias for "BINARY". This is actually ok, except for the way Ruby 1.9 handles concat with a non-BINARY string, e.g. UTF-8:

>> '日本'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8')) Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

Although the following works (equivalent to how Ruby 1.8 works):

>> '日本'.force_encoding('BINARY').concat('語'.force_encoding('BINARY')) => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"

The surprise is that it "sometimes works", when a string contains only valid ASCII-7 characters, giving the impression that a patch fixed the problem:

>> 'abc'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8')) => "abc語"

(I used force_encoding here for consistency in different locale settings).

Solutions that come into mind: -----------------------------

1. force_encoding should not be used, unless really necessary, and this rule should be applied to ERB. Unfortunately, I have no idea why ERB uses force_encoding, but I can come up with a few reasons, the main one being: Rails uses ERB (a general lib) for a specific purpose and requiring a non-ASCII-8BIT encoding is just as specific. I would really like an opinion on this.

I don't know why ERB forces encoding to ASCII-8BIT in the absence of a magic comment. See r21170. The ERB compiler should probably take a default source encoding option that's used if the magic comment is missing.

2. Don't use ERB. AFAIK, this is why Rails 3.0 works.

Using Erubis is a possibility as well.

3. Treat everything as binary, since the resulting file is sent to a browser, which will detect the encoding anyway. This is also doesn't affect performance, but it ruins the whole idea of having encoding support, possibly breaking test frameworks instead.

-1

4. Force UTF-8. This is the brute-force idea used in many patches and workarounds, and this prevents commits from happening. People should have a right to use non-utf8 ERB files and render in any encoding e.g. EUC-JP.

-1

5. Try to be intelligent, and guess. This means handling everything, except BINARY. The problem is how do we know what encoding to use for template input? And what encoding do we use for output?

We could set a single default encoding for the app, like we're doing in Rails 3.

Solution 1 would be best, but with force_encoding already in the wild with Ruby 1.9, including ruby-head. So that leaves solution 5. Option 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require all template input strings to be set to BINARY.

Solution 5 ----------

force_encoding has to be used at least once somewhere in Rails - to fix what ERB "breaks", but on what basis should the encoding be selected? For performance, there should be no transcoding during rendering, unless absolutely necessary.

When we think about it, the output depends on what we want the browser to receive, and that is why many people are pushing UTF-8: the layout usually has UTF-8 anyway, and it would otherwise have to be parsed to get the encoding from the content-type value.

The input using in rendering a template is a mixture of what web designers provide, the translators use, the databases return and Rack emits, among other things.

The policy in Rails could be: "don't allow multiple encodings during template rendering". I believe the effort required to do otherwise is not be justified.

This would force other gem developers to provide a way to set or read the correct encoding they use or stick with the current default. In this case (#2188), ERB has to either provide a way to either return the result in a encoding specified by Rails, or the ERB handler should be adapted to provide this functionality.

The problem with this: ERB templates do not have an embedded encoding. Which means we need a way to specify the encoding used in the template.

Andrew Grim fixes this in his patch here:

https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff

I am only worried about the default case, when no encoding is set. "ASCII_8BIT", the result of ERB, is not acceptable, unless the "internal" encoding would also be BINARY. I would propose merging the following with the patch above:
 def compile$template$
   input = &quot;&lt;% \_\_in\_erb\_template=true %&gt;\#\{template\.source\}&quot;
   src = ::ERB\.new$input, nil, erb\_trim\_mode, &#39;@output\_buffer&#39;$\.src

   if RUBY\_VERSION &gt;= &#39;1\.9&#39; and src\.encoding \!= input\.encoding
     if src\.encoding == Encoding::ASCII\_8BIT
       src = src\.force\_encoding$input\.encoding$ \#ERB workaround
     else
       src = src\.encode$input\.encoding$
     end
   end

   \# Ruby 1\.9 prepends an encoding to the source\. However this is
   \# useless because you can only set an encoding on the first line
   RUBY\_VERSION &gt;= &#39;1\.9&#39; ? src\.sub$/\\A\#coding:\.\*\\n/, &#39;&#39;$ : src
 end

The ERB compiler is supposed to preserve the input file's source encoding unless it has a magic comment. Puzzled why this is necessary. It should also be fixed in ERB itself, I think.

And here is an example test case, similar to many others already in the tickets, which shows the issue:

<%= "日本" %><%= "語".force_encoding("UTF-8") %>

A few things here to note (for both patches put together):

- the fallback encoding would be assumed to be the same as ruby default, which can be set by the locale, RUBYOPT with -K option, or using Encoding.default_*. I believe this is sufficient flexibility.

- note that there are no assumptions regarding the charset and the ASCII_8BIT case is handled with this in mind

- obviously, test cases would be executed with different Ruby encoding defaults - testing one setup no longer guarantees anything. Rails tests should work with almost any default encoding, which means testing at least on 3 should be recommended before a patch is committed: (BINARY + UTF-8 + EUC ?).

- similar conversion to the "internal" encoding would be required for all strings from other engines, databases and Rack, regardless of whether they are in UTF-8 or not. As for Rack and strings submitted through forms, they should ultimately be also in the "internal" encoding and not BINARY (unless "internal" *is* BINARY), but getting this to work is a can of worms in itself (AFAIK, this is true for native Japanese sites, where assuming UTF-8 is almost never valid).

- there are a few other places where ERB is used, but I prefer to leave that until this single case is solved. Fixing other template issues should be done separately.

I hope this is enough to be committed into 2-3-stable, IMHO. At least as a first step after many months of threads, discussions, issues, tickets, articles, without any fully acceptable patches or progress.

Also, I believe the tickets in LH need some love - just to straighten out the issue and introduce more clarity. The best results would be to start closing the tickets with definite conclusions and guidelines, so that people start using Ruby 1.9 with Rails, so plugin developers in turn get enough time and feedback to get things right.

IMPORTANT: I had intention of offending anyone by the following digests - I just wanted to provide an overview of the lack of progress, the complexity of issue and the willingness to help, despite months without progress. I admit I have no idea what prevented the problem from being solved a long time ago.

Ticket #2188: #2188 Encoding error in Ruby1.9 for templates - Ruby on Rails - rails 1. Incorrect mention of I18n and #2038 as similar error 2. Correctly identified problem (Hector E. Gomez Morales) 3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector) 4. Unintentional hijacking with a MySQL problem (crazy_bug) 5. MySQL DB problem redirected to #2476 (Hector) 6. Unintentional hijacking with a HAML problem (Portfonica) 7. Jakub Kuźma identifies a wider set of problems 8. Jakub Kuźma identifies Rack problems 9. Adam S talks about setting default encoding in Rails 10. Jérôme points out the need for a default encoding for erb files 11. Jeremy Kemper notes that the reports are not really helpful 12. Rocco Di Leo provides detailed test case, but formatting problems make it unreadable 13. Adam S suggests solving the problem by converting ASCII -> UTF8 14. hkstar mentions the lack of progress 15. Jeremy Kemper notes that the issue still hasn't been properly investigated 16. Turns into a discussion about UTF-8 support in 1.9 17. Andrew Grim proposes alternative patch that honors ERB template encoding 18. ahaller notes strange behaviour in ERB 19. Marcello Barnaba proposes general monkey patch for ActionView, probably related to Rack issues 20. UVSoft proposes patch for HAML 21. Alberto describes the problem - just as Hector did 22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH

What I propose is combining the two patches above to close this issue, and give references to non-related tickets which give a similar error.

Ok, good. They'll need to be rebased against master, and I think Andrew's patch breaks some tests since it changes the ERB line numbers.

#Ticket 1988: Make utf8 partial rendering from within a content_for work in ruby1.9 #1988 Make utf8 partial rendering from within a content_for work in ruby1.9 - Ruby on Rails - rails 1. Patch that works around the issue 2. Jeremy Kemper does not accept the patch due to being utf-8 - only 3. TICKET STATUS IS INCOMPLETE

What I propose is solving #2188 first and then investigate this bug further - it could be a bad assumption about the encoding of strings returned by tag helpers in a specific case.

#Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1 #2476 ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1 - Ruby on Rails - rails 1. Hector describe database adaptor problem with 1.9 encodings, provides a mysql-ruby fork and other links 2. Patches and fixes for databases / adaptors (James Healy, Jakub Kuźma, Yugui) 3. Talk about assuming UTF-8 for databases 4. Loren Segal proposes hack instead of modifying mysql-ruby 5. Micheal Hasensein asks about issue 5 months later 6. UVSoft accidentally posts HAML workaround 6. TICKET STATUS IS NEW

My proposal - after fixing #2188, a short description of adapters/databases and fixed versions could be presented - and possibly have this issue closed, to prevent it being listed as a pending UTF-8 issue. Work could be started on validation code for the strings returned by database adapters and their compatibility with the "internal" encoding.

+1

Open/new tickets related to Rack:

#3331 [PATCH] block invalid chars to come in rails app. - Ruby on Rails - rails #3392 rack.input requires ASCII-8BIT encoded StringIO - Ruby on Rails - rails #4336 Ruby1.9: submitted string form parameters with non-ASCII characters cause encoding errors - Ruby on Rails - rails

My proposal: gather issues and investigate with the help of people working with non-utf and non-ascii input - I believe Japan is such a country, where UTF-8 assumptions about Rack input are wrong.

Rack is woefully lagging on encoding support. It needs an encoding push of its own.

Ruby CGI has updated to include just-enough support, e.g. for giving an encoding for parsed query parameters.

I would like to thank everyone who invested even the slightest bit of time in solving this issue.

I hope the information here will help find a solution that will work without issues for years to come and that creating Rails applications will be an enjoyable experience for users, designers, developers, translators and all contributors, regardless of their environment and language preferences.

Indeed! Thanks for leading the charge, Cezary.

jeremy

JN_Coward · April 19, 2010, 8:28pm

It's great to see someone finally take charge of this! I still don't have the greatest grasp of character encodings, but what you're suggesting sounds good.

Maybe one additional thing: make all generators put the magic comment with the standard encoding at the top of all source files they create. Does that sound like a good idea? Should we open a ticket for it?

Just to clarify how important this issue is: Rails 2.3 claims to be Ruby 1.9 compatible, but until this is fixed, even the most trivial of applications simply don't work on 1.9, especially if the application is in a language that often uses non-ASCII characters (pretty much anything other than English, in other words). This has prevented me from moving to Ruby 1.9.

/Jonas

Cezary_Baginski · April 25, 2010, 2:32am

Here are some updates I have sinced I started working on LH #2188 until a patch I submitted there. Although the patch specifically fixes ERB using workarounds in the Rails ERB handler, I tried to make the approach as generic as possible.

> The general idea is: allow only one "internal" encoding in Rails at > any given time, based on the default Ruby encoding (or configurable).

I chose Encoding::default_external for this.

The short story is that Encoding::default_internal shouldn't really matter for Rails.

> As a result of enforcing this, all Rails tests should work with any > encoding

Probably the most convenient way to test this is:

RUBYOPT=-Ke rake tests

See #4466 for an example test script for ActionPack and the trivial fixes that make everything work.

> The guideline for gem and plugin developers would be: do not create or > return strings (other than internal use) that are not compatible with > the default encoding both ways. > > In some cases, it may be acceptable to drop or escape characters that > cannot be transcoded (maybe Rack input, for example).

+1

String#{encode,encode!} have both nice options for replacing characters and provide almost all the necessary functionality (force_encoding handles a few other surprise cases). Rack, and converting between incompatible encoding are places where this seems useful.

I don't know why ERB forces encoding to ASCII-8BIT in the absence of a magic comment. See r21170. The ERB compiler should probably take a default source encoding option that's used if the magic comment is missing.

Two issues are worth mentioning: regexes have their own encoding semantics and force_encoding is actually necessary if you want to "encode" a string to or from ascii-8bit specifically.

ERB uses a regex to detect the encoding comment, but the regex has to have the same encoding as the source stream, so ERB uses ASCII-8BIT to be able to run the regex on the stream, regardless of the stream's encoding.

Then ERB continues to use that ASCII-8BIT string for compiling, which seems to be ok, because the strings are passed to eval, with and encoding comment in the beginning...

The problem actually lies elsewhere: ERB didn't detect the encoding, because the encoding magic wasn't in the first tag. The first tag was added by Rails ERB handler:

"<% __in_erb_template=true %><%# encoding ...."

Andrew Grim worked this out and created a patch for this in #2188.

Should ERB search the whole stream for an encoding tag? Or should Rails guarantee the first tag has the encoding information? I believe the second option will save more time. Erubis is also a reason to forget about patching ERB directly.

Using Erubis is a possibility as well.

Patching the ERB problem taught me that although this will solve many encoding issues and headaches, it may unfortunately hide a few general design flaws that should be worked on before Rails 3.0 or Ruby 1.9.2 become production ready.

The workarounds I used for patching ERB seem actually quite generic. They allow one to have partials in different encodings and even have ASCII-8BIT as the Ruby default_external without breaking anything. And any encoding incompatibilities occur during encode! calls in the ERB handler - close to the problem.

Something similar could be done for db adapters, because just like the template handler being ERB instead od Erubis, people can have old/broken libs, gems and plugins. And since Rails is becoming more modular with 3.0, additional issues may surface, slowing down development in the long run.

> 3. Treat everything as binary, since the resulting file is sent to a > browser, which will detect the encoding anyway. This is also doesn't > affect performance, but it ruins the whole idea of having encoding > support, possibly breaking test frameworks instead.

-1

Actually, it turns out that supporting everything as binary takes really no more effort than supporting multiple encoding and it is a good way to test Rails, applications and gems. ASCII-8BIT is the most restrictive when it comes to encoding making it ideal for regression tests. Allowing an application to support ASCII-8BIT through default_external requires more effort, but is worth it.

> 4. Force UTF-8. This is the brute-force idea used in many patches > and workarounds, and this prevents commits from happening. People > should have a right to use non-utf8 ERB files and render in any > encoding e.g. EUC-JP.

-1

Complementary to ASCII-8BIT, UTF-8 is ideal for an 'internal' encoding and for detecting cases where ASCII-8BIT is (mis)used. UTF-8 should actually *be* used when there are multiple - incompatible otherwise - encodings. Ruby 1.8 just glues anything together, but in 1.9 everything should first be encoded to something as general as UTF-8 before encoded to ASCII-8BIT (if there is such a need). For example, this would allow people to make ISO2022_JP web pages from EUC-JP templates and SJIS databases - by using UTF-8 as the internal encoding.

Although choosing UTF-8 seems wrong, in this case it prevents us from loosing encoding information from converting to ASCII-8BIT.

We could set a single default encoding for the app, like we're doing in Rails 3.

I admit I haven't even tried Rails 3.0. Shame on me.

A single default encoding within rails is a must to gracefully handle the example I gave above (with EUC, SJIS and ISO2022). Of course UTF-8 is reasonable, but there is no reason to assume UTF-8 for all cases.

The ERB compiler is supposed to preserve the input file's source encoding unless it has a magic comment. Puzzled why this is necessary. It should also be fixed in ERB itself, I think.

Rails inserts code that breaks ERB's magic comment detection. How does Erubis handle the issue? Does it regex the stream?

> - obviously, test cases would be executed with different Ruby > encoding defaults - testing one setup no longer guarantees > anything. Rails tests should work with almost any default > encoding, which means testing at least on 3 should be recommended > before a patch is committed: (BINARY + UTF-8 + EUC ?).

Actually, all 5 cases could be used in Rails tests and in apps:

- no K option, Ks (sjis), Ke (euc-jp), Ku (utf-8), Kn (binary/ascii-8bit)

ActionPack is trivial to fix. Other Rails gems may require more work.

Ok, good. They'll need to be rebased against master, and I think Andrew's patch breaks some tests since it changes the ERB line numbers.

I haven't noticed this. Could you provide some details? I am wondering how I missed this.

I didn't check his patch too thoroughly, since I was busy getting a patch #2188 out the door.

I only checked my own patch (based on his) on ActionPack and ActiveSupport. Currently, everything seems to work, so let me know if I looked something over.

Rack is woefully lagging on encoding support. It needs an encoding push of its own.

Ruby CGI has updated to include just-enough support, e.g. for giving an encoding for parsed query parameters.

I would handle Rack last or at least after Rails tests work in all the encodings. The reason is: I learned not to underestimate encoding problems and leaving Rack for last seems like a good choice.

Indeed! Thanks for leading the charge, Cezary.

I'm happy to helpful in some way.

Cezary_Baginski · April 25, 2010, 11:01am

It's great to see someone finally take charge of this! I still don't have the greatest grasp of character encodings, but what you're suggesting sounds good.

Thanks

Maybe one additional thing: make all generators put the magic comment with the standard encoding at the top of all source files they create. Does that sound like a good idea? Should we open a ticket for it?

This is a great idea, since people new to Rails usually both are new to Ruby and use generators. The question is how do we choose the encoding? Consider the following:

% LC_CTYPE=en_US ruby -e 'p IO.read("_foo.rhtml").encoding' #<Encoding:US-ASCII>

% LC_CTYPE=en_US.UTF-8 ruby -e 'p IO.read("_foo.rhtml").encoding'
#<Encoding:UTF-8>

This is important for partials. People will eventually create partials without the encoding information, which will be rendered from templates. I would prefer us-ascii to be used by generators instead of Ruby's Encoding::default_external for the following reasons:

- user may have a non-UTF8 environment, and us-ascii will more likely give an error closer in the call stack to the file without the encoding comment

- user shouldn't really use non ascii characters in partials and templates - i18n is the solution and will help localize the application when it goes global

- this would help adopt using '# encoding: us-ascii' as a no-brainer solution instead of '# encoding: utf-8' which usually just makes problems more obscure

The only upside to using UTF-8 at all instead is quickly fixing huge sites with many localized pages, but generators are for new projects anyway.

So, by all means, yes, please open a ticket, since this may not be too trivial and encoding issues will more likely need good understanding rather than assuming Rails can and will magically fix everything.

Just to clarify how important this issue is: Rails 2.3 claims to be Ruby 1.9 compatible, but until this is fixed, even the most trivial of applications simply don't work on 1.9, especially if the application is in a language that often uses non-ASCII characters (pretty much anything other than English, in other words). This has prevented me from moving to Ruby 1.9.

The m17n support in Ruby > 1.9 is a great concept. Unfortunately balancing: - correctness - performance - robustness in a production environment quickly turns encoding problems into philosophical debates. Without a deep understanding of encoding internal it is too easy to "fix" things by just converting to UTF-8, hiding the real issues.

Thanks for bringing this up!

michael.hasenstein · April 25, 2010, 7:25pm

I disagree. There are lots of apps written for just one specific country without any intention of going global. Besides, one can have locale-specific view files, can't we? Having "to i18n" each and every string is a little bit too much. Of course, the folks in the US won't notice, you guys are well off while the rest of the world suffers from such a policy...

....

- user shouldn't really use non ascii characters in partials and templates - i18n is the solution and will help localize the application when it goes global

...

Cezary_Baginski · April 25, 2010, 9:28pm

I disagree. There are lots of apps written for just one specific country without any intention of going global. Besides, one can have locale-specific view files, can't we? Having "to i18n" each and every string is a little bit too much. Of course, the folks in the US won't notice, you guys are well off while the rest of the world suffers from such a policy...

Forgive me for not making the context clear. There is no 'policy' here, just a suggested generator default behavior for users writing mainly US applications, possibly wishing to easily globalize their applications in the future. In *this* case specifically, my conclusions are:

- using utf-8 instead of ascii-us for encoding comments hide problems for those users

- people with no experience in encodings other than us-ascii will forget the encoding comments more often than not

- Ruby 1.9 chokes when trying to convert two non us-ascii compatible strings

- generators could create files with ascii-us by default to prevent the above

If that case does not describe your own, chances are you already know what you are doing and Rails gives you all the freedom you can get to adapt things to your own situation, choosing the right tool for the right job.

The reason for the proposed generator default is *exactly* to help people unaware of encoding problems to deliver applications that spare others the suffering and grief.

Paul9 · April 26, 2010, 8:30am

- user shouldn't really use non ascii characters in partials and templates - i18n is the solution and will help localize the application when it goes global

-1

if you know that a rails app will run only within one country within a controllable group (e.g. intranet apps) it does not make much sense adding the overhead of seperate language files.

Just to clarify how important this issue is: Rails 2.3 claims to be Ruby 1.9 compatible, but until this is fixed, even the most trivial of applications simply don't work on 1.9, especially if the application is in a language that often uses non-ASCII characters (pretty much anything other than English, in other words). This has prevented me from moving to Ruby 1.9.

The m17n support in Ruby > 1.9 is a great concept. Unfortunately balancing: - correctness - performance - robustness in a production environment quickly turns encoding problems into philosophical debates. Without a deep understanding of encoding internal it is too easy to "fix" things by just converting to UTF-8, hiding the real issues.

well - i "upgraded" our site running in germany to ruby1.9.1, unicorn and rails 2.3.6 even with using utf-8 as a default i had to make various patches within rack to get it up and running.

rack: utils

   # Unescapes a URI escaped string. (Stolen from Camping).    def unescape(s)      result = s.tr('+', ' ').gsub(/((?:%[0-9a-fA-F]{2})+)/n){        [$1.delete('%')].pack('H*')      }
     RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8) : result
   end    module_function :unescape

found at lighthouse...

the next one is horrible - i know, but it works for now:

def parse_query(qs, d = nil) params = {}

(qs || '').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p| k, v = p.split('=', 2).map { |x| unescape(x) } begin if v =~ /^("|')(.*)\1$/ v = $2.gsub('\\'+$1, $1) end rescue v.force_encoding('ISO-8859-1') v.encode!('UTF-8',:invalid => :replace, :undef => :replace, :replace => '')
if v =~ /^("|')(.*)\1$/ v = $2.gsub('\\'+$1, $1) end end

(we use analytics at the site - analytics stores the last search query within a cookie. If a user will browse google and finds the site with an umlaut query this query will be stored within the cookie. parse_query will be used by rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a default and get an incoming cookie with an different encoding../)

the next ugly thing

def normalize_params(params, name, v = nil) if v and v =~ /^("|')(.*)\1$/ v = $2.gsub('\\'+$1, $1) end name =~ %r(\A[\[\]]*([^\[\]]+)\]*) k = $1 || '' after = $' || ''

return if k.empty?

if after == "" params[k] = (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) # params[k] = v elsif after == "" params[k] ||= raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}'" unless params[k].is_a?(Array) params[k] << (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) # params[k] << v elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)

all patches i found did not include the multipart solution ... this hack makes sure that multipart variables will be utf-8 forced too ...

Yes / i am glad and thank you that you made this overdue summary! i hope others will have a better start into the ruby1.9 rails 2.3 world as me. In fact there were times i really wondered why someones dares to state that rails is 1.9 compatible for a real world (not real US) app!

Thanks a lot!

!DSPAM:4bd553b359886468012210!

Cezary_Baginski · April 26, 2010, 10:55am

I didn't correctly state what I meant and thank you for helping me realize that

What I did mean was that users shouldn't assume non-ascii characters will always work correctly with Ruby 1.9, without specifying encoding comments or assuring specific, correct environment settings. So, let me rephrase myself:

Users should not be able to use non-ascii characters in a us-ascii environment without providing an alternative encoding comment or overriding the environment settings. If neither of these are acceptable, i18n is a suggestion. This behavior would be consistent with the way Ruby loads source files. The reason is that doing otherwise can give obscure, hard to track encoding problems, looking like Rails bugs.

By supplying a _default_ "us-ascii" encoding comment in generated template files, we help people oblivious to encoding details to do the right thing or do the necessary research (i18n, change encoding comments, localized versions of pages, etc).

Encoding problems can be so frustrating, it is easy to perceive US developers as being ignorant. The truth is, it is unusual for them to even experience the problems or reproduce without effort, let alone research ways to test the issues effectively. This feature may slightly help with the latter.

Suggestion

Paul9 · April 26, 2010, 11:52am

i am very busy in writing new features for our project so i do not have the time/brainspace/brainpower now to think about clean solutions, but just one more - might help you too:

in my application_controller i added the very very very bad (i know - do not blame me - its working charset test for routes with specialized chars as i wanted to use paths with umlauts too. if a browser/search-bot defaults to ISO requesting www.domain.com/über will obviously break things when using "über".force_encoding('utf-8') within rails...

REGEXP_ISO = Regexp.new('[^\xc3][\xe4\xf6\xfc\xc4\xd6\xdc\xdf]', nil, 'n') REGEXP_MACROMAN = Regexp.new('[^\xc3][\x8a\x9a\x9f\x80\x85\x86\xa7]', nil, 'n')

  def check_params_encoding( key )     unless params[key].blank?       params[key].force_encoding('ASCII-8BIT')       if params[key].match(REGEXP_ISO)
        params[key].force_encoding('ISO-8859-1')         params[key].encode!('UTF-8',:invalid => :replace, :undef => :replace, :replace => '')
      elsif params[key].match(REGEXP_MACROMAN)         params[key].force_encoding('macRoman')         params[key].encode!('UTF-8',:invalid => :replace, :undef => :replace, :replace => '')
      end
    end     params[key].force_encoding(Encoding::UTF_8)   end

btw. i switched in one step from

ruby 1.8.7 => ruby 1.9 backgroundrb => delayed_job ferret => sphinx thin => unicorn 2.3.2 => 2.3.6 (memcached as frontend cache)

and i have to say (after blood, sweat and tears exceptions on the production servers leading to those quick hacks

IT ROCKS !

no more aaf ferret issues - fast searches - slim job workers - painless fast restarts - no more (un)fair balancing !

good luck for your projects with rails and all the best on your 1.9 travel !

Paul

Paul9 · April 27, 2010, 11:41am

yesterday i found another situation where i got hit by the 1.9 encoding problems.

Believe it or not i've seen a case at our site where IE8 sends ISO-encoded uris after recieving the page incl. the link in UTF-8. I thought this is a save one - but it is not!

Now i decided to add a very simple module force_recoding within lib (find it below / yes, could / should?! be a - rchardet like - native?! kernel method) and patch rack utils and rails - request.

btw. rchardet even in the 1.9 version of http://github.com/speedmax/rchardet did not work.

for now i would say - do not use rails with 1.9 outside the us unless you have fun debugging on production servers - and make sure that exception_notification works! - this last error prevented it from sending mails as erb got crazy while spitting the iso string into an utf-8 context ... i was informed by users ... smells like 1995 ...

and please rails core - write down the encoding problems within "Improved compatibility with Ruby 1.9" at Ruby on Rails — Ruby on Rails 2.3.5 Released and help newcomers get the right trail to rails!

Now that i am working with rails for about 3 years - i can say i have at least a bit of experience - a newcomer will never use rails again when facing this kind of hard to track down errors. (i.m.O. only segfaults could be worse!)

i patched rack/utils.rb: ---------------8< ------------

# -*- encoding: binary -*-

require 'set' require 'tempfile' +require 'force_recoding'

module Rack # Rack::Utils contains a grab-bag of useful methods for writing web # applications adopted from all kinds of Ruby libraries.

module Utils

Topic		Replies	Views
Encoding problems with Rails 3 + Ruby 1.9.1 (big surprise) rubyonrails-talk	21	504	July 5, 2010
Ruby 1.9 + Rails 2.3.5 + UTF8 support a dead end? rubyonrails-core	7	322	April 25, 2010
incompatible character encodings: ASCII-8BIT and UTF-8 rubyonrails-talk	46	2553	December 6, 2010
Rails and Ruby 1.9 encoding issues rubyonrails-core	6	225	May 22, 2009
Character encoding problems. rubyonrails-talk	6	166	September 6, 2011

Overview of Ruby 1.9 encoding problem tickets

Related topics

More Resources