h() doesn't have any parameter for encoding being used?

it seems that there is no parameter for the function h() (html_escape()) to indicate the character encoding being used?

for PHP, its htmlspecialchars() function has a dozen encoding possible, such as UTF-8, Chinese Big5, Chinese GB, Russia, Japanese.

i think thought, h() will work for UTF-8, since h() will only touch the 4 special characters

  < > & "

and replace them with &lt; etc and those 4 characters are all in the 0x00 to 0x7F range, and h() will leave the other bytes intact (unchanged). Now, since a character in UTF-8 can be 1 to 4 bytes, and that any ASCII will be represented as 1 byte, which is 0x00 to 0x7F itself, and that 0x80 to 0xFF and other unicode characters will be 2 to 4 bytes long, but with the 1st to 4th bytes all being in the 0x80 to 0xFF range (see UTF-8 UTF-8 - Wikipedia ), so when h() replaces those 4 ASCII characters, it will successfully do so when h() sees those 4 characters as a 1-byte character, and then it will bypass all the 1st to 4th bytes characters because those characters are in the 0x80 to 0xFF range, and therefore can never be matched as one of those 4 special characters, so the job of replacing those 4 characters will be done with no side effect whatsoever done to the non-ASCII characters.

I don't think Rails supports UTF8 yet... but I could be wrong.

The default charset for action renderings is UTF-8 since Rails 1.2.


Ruby 1.8 has a global idea of character enconding, which is configured in the $KCODE global variable.

Rails 1.2 and above by default set $KCODE to a value that means everything is UTF-8. Source code, strings, regexps, etc. It also sets a HTTP header that tells the client (X)HTML goes as UTF-8. Thus, the client sends form data back in UTF-8 as well. And everything works transparently.

When you do I/O you are responsible for knowing the encoding of incoming data, and the expected encoding of outgoing data. You use iconv if needed to guarantee them. Any I/O operation has to be in control of the involved character encodings.

Some stuff in Ruby 1.8 does not play well with UTF-8, for example you cannot compute the length of a string with String#length because that method counts bytes. But some other stuff do work, like pattern matching. For example "." really matches a character, which may not be a byte in UTF-8, as you point out.

So, if you are using regexps you are safe in that regard. The helper #h is really an ERb alias of the ERb method #html_escape (it is not a Rails helper), and that method is implemented using regexps:

   def html_escape(s)      s.to_s.gsub(/&/, "&amp;").gsub(/\"/, "&quot;").gsub(/>/, "&gt;").gsub(/</, "&lt;")    end

Hence, it works correctly in UTF-8.