Invalid byte sequence utf-8 OR best option to sanitize content brought in with net::http? single non-utf character causes rails to crash

hi all,

platform: debian lenny, ruby1.91.p0, passenger/apache-multithread, rails2.3 in vendor/postres and sql server via odbc. all current gems.

i have legacy asp content on win2k servers that i wrap in rails controllers. this all worked great with ruby1.8, but now that we are dealing with encoded strings in ruby1.9, i am having page crashes randomly as users have cut and pasted high ascii code characters (e.g. ascii 150 - a fancy dash) that are ms only and non-standard.

normally, i just wouldn't have cared or even worried about it that much; however, in testing this a bit further after a few mysterious rails page crashes, i did more experimenting. i found that if i put the following in my asp page, it will cause the rails page to fail with "invalid byte sequence in utf-8" ror/vendor/rails/activesupport/ lib/active_support/core_ext/blank.rb: 50

the offending asp code is:

<%= chr(150) %> this is my own doing to reproduce the issue, but there are many non- standard windows characters that are not utf-8 compliant that probably riddle my sql server database because users like to cut and paste content from word and other places.

it turns out that because the content that i bring in via ruby net::http has non-utf8 characters, the encoding is set to ascii8bit and when i do force_encoding(utf-8), valid_encoding is false and the page just fails. html::sanitize isn't an option as i don't want to strip the tags. the content is from internal trusted servers that i control. i just need to sanizite, i guess, the bad characters.

my thoughts/questions: 1) seems like rails should be less brittle about managing encoding such that blank? doesn't just fail when the valid_encoding is false. or you shouldn't be able to create a string if the encoding is bad. or it should make best efforts to transliterate the bad characters. something.

2) is iconv my best option. seems kind of nuts that i have to reencode the entire html page for one character. this does work using the translit//ignore options i get my pages, but i wonder at the overhead.

3) as usual, trying to make my ms iis5 servers do anything useful is a non-starter. sure it says it can generate utf-8, but trying it the (typically confused and poorly documented) 25 different ways to make it do so, results in nothing but more wasted time. so i need a good rails solution that "just works."

4) it occurs to me that it could also be that ruby is setting the default to acsii for net::http regardless of how iis is sending it. how do i check/set the encoding.default_external in rails. why does rails remove the Encoding class. it isn't there in console, but is in irb. i dislike rails remvoing native ruby classes.

please. i am so close to having ruby1.9/rails2.3 working, but this encoding stuff is really a hassle.

1) 1.9 is the wild wild west unfortunately, even more in all this encoding mess so as a developer right now is your responsability to transcode any external data to UTF-8(or you encoding of choice). I have sent a GSOC proposal to resolve this problems and let rails handle this problems for you and well "just work".

2) You can use the String#encode method supplied in ruby 1.9 That does conversion between the supported encodings in ruby. It has a parameter to ignore or to replace invalid character with a placeholder value

# encoding: utf-8 pi = "pi = π " puts pi.encode("iso-8859-1", undef: :replace, replace: "??") returns pi = ??

4) What you really want is to set the internal_encoding. If you have set the internal_encoding of your program every IO is transcode from its external_encoding to your internal_encoding in a transparent way. I recommend you read this blog:

Rails doesn't remove the Encoding class is available in the console. I think your console for some reason is using ruby 1.8.

thanks hector,

i think you are right about the console. i tried the non-compat change to case statements in ruby 1.9 with colons and console seemed fine with that. so i guess somehow even though i change script/console to #!/usr/local/ruby1.9/bin/ruby or even comment out the sherbang and rename it script/console.rb and run it with my /usr/local ruby, i still get 1.8. iguess

is there a way to set the ruby that console runs. this is one of those things that i think it pretty convoluted in rails. we should just have an external config file and set these things. all the calculated paths and other "convention" stuff works most of the time, but sometimes it just creates confusion. imho

regardless, any ideas about how to set the ruby version for console? ...gg

thanks hector,

i think you are right about the console. i tried the non-compat change to case statements in ruby 1.9 with colons and console seemed fine with that. so i guess somehow even though i change script/console to #!/usr/local/ruby1.9/bin/ruby or even comment out the sherbang and rename it script/console.rb and run it with my /usr/local ruby, i still get 1.8. iguess

is there a way to set the ruby that console runs. this is one of those things that i think it pretty convoluted in rails. we should just have an external config file and set these things. all the calculated paths and other "convention" stuff works most of the time, but sometimes it just creates confusion. imho

regardless, any ideas about how to set the ruby version for console? ...gg

ok. found this patch to ../railties/lib/commands/console.rb

https://rails.lighthouseapp.com/attachments/93770/script-console-invoke-used-rubys-irb.diff

as script/console is just a wrapper for console.rb, that is the place to intervene. stock rails just ends up calling your default irb without bothering to see what version of ruby you are running.

this fixes my immediate problem so thanks. am going to grep RUBY_PLATFORM to see if that can just be set somewhere in rails as that seems to be referenced before searching for system location of irb.

...gg

hector,

further update:

i was able to set both my internal and external encoding thanks to hongli lai at phusion passenger. he helped me with a wrapper for my local ruby that uses the encoding option. not suggesting that this is his preferred method though, but you don't seem to be able to pass ruby options any other way that i'm aware of in passenger's apache config.

/usr/local/ruby1.9/bin/ruby_wrapper: #!/bin/bash exec /usr/local/ruby1.9/bin/ruby -E utf-8:utf-8 "$@"

then in apache2.conf: PassengerRuby /usr/local/ruby1.9/bin/ruby_wrapper

restart apache.

in a controller: raise "#{Encoding.default_internal} #{Encoding.default_internal}"

results in: utf-8 utf-8

so all is good. for my app anyway. irb and script/console is a pain.

unfortunately, after all this, my asp pages still get ascii encoded when brought in by net::http (after adding all the asp settings i can to convince it to use utf). also, more unfortunately, your assertion that if i have the default encodings set right (particularly default_internal which i do now), that it will silently and fautlessly convert my ascii page without error. no joy. got same utf encoding error that i started with.

so...guess i am back to doing explicit encoding like you suggested or going back to iconv.

all in all i have to say that ruby1.9 and rails2.3 and encoding and irb and compiling your own ruby and... are still very rough.

...gg

Do you have a test case that I can reproduce the issue that you’re seeing?

Thanks,

-Conrad

so i use lib/asp.rb module to get legacy asp content from internal win2k/iis5/asp (classic not .net) servers as a mixin and require it in my application_controller.rb as i have many asp pages. i do it this way because it gives me a smooth incremental upgrade path to rails from asp by replacing page for page as we write a better rails replacement. this way my routes are all rails and i just call asp_get_content when i have an asp page to wrap.

controller: def my_legacy_page   asp_get_content end

lib/asp.rb module asp   def asp_get_content

   @asp_response = Net::HTTP.start(host, port) {|x|       x.read_timeout = 1200       x.send_request(method, path, data, headers)     }

    # return false on redirects so we can use custom renders like so:     # render :foo => :bar if asp_get_content while still allowing just     # asp_get_content without anything else for standard stuff     case @asp_response     when Net::HTTPRedirection         redirect_to "#{@asp_response['location']}"         false     else        true     end end

view: <%= @asp_response.body %>

to reproduce the issue, just add

<%= chr(150) %> to the asp page. rails will choke with invalid byte sequence utf-8 as soon as the response.rb tries to parse @asp_response.body. see the above comments for the stack trace.

this is just my particular situation. i suspect you can add any high, non-standard ascii code that windows likes like ascii 128-159. my test case is ascii 150 that will reliably reproduce the issue. my point is not with encoding per se, i just think that rails should be a bit more fault tolerant around encodings as interop makes it almost a certainty that we will pull incontent with bad encodings just as we pull in malformed html. we cope with the latter well but now need to do so with the former. imho.

thanks...gg

also, my particular case is with asp content; but i am sure that the problem can be reproduced with any web stack or even a static text file with these characters.

Sorry for the late response. I took a dive in the Net:HTTP code and I have some bad news.

It uses a BufferedIO over the socket of the connection. And when it reads from the socket it uses IO#sysread that is the lowest read you can use in ruby. This methods always returns a ASCII-8BIT string. So you have to transcode or force_encoding the responses from Net:HTTP explicitly.

Hector