Javascript's encodeURIComponent works differently from CGI.eacape or
ERB::Util.u.
Well the difference is that the javascript stuff is produced UTF16 and
the ruby UTF8 (although the documentation I can find suggests that the
javascript should also be producing utf8).
for example:
encodeURIComponent('中文') = '%D6%D0%CE%C4'
>> CGI.escape("中文")
=> "%E4%B8%AD%E6%96%87">> ERB::Util.u("中文")
=> "%E4%B8%AD%E6%96%87"
Is there any way to get the same encoded result with ruby code?
The are various libraries for messing around with string encodings,
including iconv, and pack/unpack have some specifiers that are
relevant for unicode stuff, and rails itself also has various unicode
utilities in it.
Frederick Cheung wrote:
> Well the difference is that the javascript stuff is produced UTF16 and
> the ruby UTF8 (although the documentation I can find suggests that the
> javascript should also be producing utf8).ith ruby code?
Thank you for your replied. May be it is the true. But how can the utf16
encodeURIComponent result to be the shorter?
Because for double byte characters utf16 is shorter than utf8.
> The are various libraries for messing around with string encodings,
> including iconv, and pack/unpack have some specifiers that are
> relevant for unicode stuff, and rails itself also has various unicode
> utilities in it.
I tried to encode the string to utf-16 encoding before passing it to
CGI.escape(), But I don't have any luck to production the same result as
encodeURIComponent did. ( I got "%FE%FFN-e%87" from "中文")
Frederick Cheung wrote:
> Those aren't playing with encodings which is apparently the issue
> here. Why does it matter anyway?
ok.
Here is the source code of ERB::Util.url_encode(s) method.
# File erb.rb, line 801
def url_encode(s)
s.to_s.gsub(/[^a-zA-Z0-9_\-.]/n){ sprintf("%%%02X",
$&.unpack("C")[0]) }
end
now it works like this:
> ERB::Util.url_encode("中文")
> => "%E4%B8%AD%E6%96%87"
Can you help me changing the url_encode code a bit, so it can return
utf16 result? ( which '%D6%D0%CE%C4' is the one I want.)
well s.unpack("U*") will turn a string into a array of integers (utf
code points) that it should then be easy to split into bytes. I'd
start from scratch rather than using url_encode though.
So, it is a way turning [20013, 25991] to '%D6%D0%CE%C4', right?
Well 20013 is 0x4E2D which is the utf16 for the first of your
characters. Looking back at what you write I'd no idea where D6D0 is
coming from - that's a completely different character according to the
unicode character palette I have. Not sure what you javascript has
been doing.