Problem with GET args and UTF-8 encoding (output of Rack::Utils.unescape() ?)

Hi folks,

Here's my basic issue, hopefully this is clear. I'm trying to submit some UTF-8 values in my query string, but they are coming out mangled on the other end. It *seems* like the problem is that what Rack::Utils.unescape() pushes out gets converted to UTF-8 somewhere in the chain (using 3.0.7, and Ruby 1.9.2, by the way), and it's mangling characters which are two bytes (for example, "%20," which is space and a one byte character, gets converted fine). I feel like I've almost figured this out, but I'm still stumped. Here's my "evidence:"

# Example UTF-8 string:

"Adélaïde de Hongrie"

# GET string (obviously URI encoded):

Started GET "/registers/results?filter[title]=Ad%E9la%EFde%20de %20Hongrie&search=&limit=4" for 127.0.0.1 at 2011-05-16 14:17:33 +0700

# What Rack produces/Rails sees (in Controller):

Parameters: {"filter"=>{"title"=>["Ad\xE9la\xEFde de Hongrie"]}, "search"=>"", "limit"=>"4"}

# Error I'm getting, when I try to "do stuff" with the above string:

ArgumentError (invalid byte sequence in UTF-8):

# What would actually be a valid string with hex UTF code points in the format above:

"Ad\xC3\xA9la\xC3\xAFde de Hongrie"

Or, in the "\u ..." format (see anything interesting here? Something obvious is eluding me...):

"Ad\u{E9}la\u{EF}de de Hongrie

To be clear, this is not a form, but an ajax query. I've tried adding the 'utf8' snowman thing manually too, but that doesn't seem to do anything...of course, maybe I'm doing that wrong.

Any thoughts/questions/pointing out of obvious errors or confused ways of thinking? I'd also appreciate any pointers to Rails documentation which describes in more detail how this stuff happens; I've just been digging through the code and it's slow going for me.

Help much appreciated!

Cheers, Dave

Okay, I'm still not there but I've realized I've been confusing a few things. This stackoverflow answer helped a lot:

I was conflating Unicode with UTF-8. But, I think that's also essentially what is happening somewhere in the process of ASCII-8BIT (output of Rack::Utils.unescape()) getting converted to UTF-8. I have to figure out how to override unescape() in my own initializer, I suppose, or intercept unescape()'s output and properly encode that.

I think I'm close to a solution, since I'm starting to understand what all the values should be and what is happening. But any help will still be greatly appreciated, since there is still something eluding my understanding.

Thanks, Dave

Hi folks,

Here's my basic issue, hopefully this is clear. I'm trying to submit some UTF-8 values in my query string, but they are coming out mangled on the other end. It *seems* like the problem is that what Rack::Utils.unescape() pushes out gets converted to UTF-8 somewhere in the chain (using 3.0.7, and Ruby 1.9.2, by the way), and it's mangling characters which are two bytes (for example, "%20," which is space and a one byte character, gets converted fine). I feel like I've almost figured this out, but I'm still stumped. Here's my "evidence:"

# Example UTF-8 string:

"Adélaïde de Hongrie"

# GET string (obviously URI encoded):

Started GET "/registers/results?filter[title]=Ad%E9la%EFde%20de %20Hongrie&search=&limit=4" for 127.0.0.1 at 2011-05-16 14:17:33 +0700

Who is producing this query string? They should be generating %c3%a9 if they are UTF8 friendly, since %e9 is just URL speak for \xe9, which smells like iso-Latin-something

Fred

Thanks for pointing out the obvious Frederick (seriously, thank you). The problem was completely on the JavaScript/browser side; the function which prepared the query string was using escape() rather than encodeURIComponent(). I replaced all the calls to escape and things started to magically work, how about that?

Thank you, I really appreciate the help!! I can't believe how much time I spent looking in the wrong places...at least I learned a fair amount about Rails internals as well as encoding issues though...haha.

Cheers, Dave