fix tidy_bytes for 1.9.x, improve performance

Norman_Clarke1 · April 9, 2010, 1:05pm

Hi all,

I was wondering if I could get some feedback on a patch I created for ActiveSupport's `tidy_bytes` method.

Right now `tidy_bytes` doesn't work with 1.9.x, since it relies on a Unicode regexp that always fails for strings with invalid UTF-8 characters. You can see the essence of the problem easily by firing up any 1.9.x irb and doing this:

ruby\-1\.9\.2\-preview1 &gt; &quot;\\x93&quot;\.split\(//u\)
ArgumentError: invalid byte sequence in UTF\-8
        from \(irb\):2:in \`split&#39;
        from \(irb\):2
        from

/Users/norman/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>

This patch resolves the issue by traversing the string as bytes rather than codepoints, and is about twice as fast as the current implementation. Rather than using the current implementation's regular expression, it checks each byte's first 0 bit to determine its validity. This Wikipedia article was a useful reference while working on the patch:

It also adds a `force` option to allow cleanup of byte sequences that are both valid CP-1252 / ISO-8859-1 and UTF-8. This can be used when the developer knows that their input is encoded in CP-1252 or ISO-8859-1 and wants to recode it to UTF-8. (Again, the presence of invalid characters will prevent doing this by simply using #encode or #force_encoding on 1.9.)

* The patch: http://gist.github.com/361115 * LH Ticket: #4350 tidy_bytes fails on 1.9.x - Ruby on Rails - rails

Here is also a library where you can see this code in isolation:

http://github.com/norman/utf8_utils

Regards,

Norman

bitsweat · April 9, 2010, 4:47pm

This is great. Thanks, Norman!

jeremy

Topic		Replies	Views
Invalid byte sequence utf-8 OR best option to sanitize content brought in with net::http? single non-utf character causes rails to crash rubyonrails-talk	9	460	April 14, 2009
Working round 'invalid byte sequence' rubyonrails-talk	4	176	November 3, 2009
invalid byte sequence in US-ASCII Rails 2.3.10 rubyonrails-talk	1	161	February 24, 2011
invalid byte sequence in UTF-8 , need to re-encode ? rubyonrails-talk	1	151	September 7, 2010
Issues with template encoding (invalid byte sequence in UTF-8): rubyonrails-talk	3	164	February 24, 2012

fix tidy_bytes for 1.9.x, improve performance

Related topics

More Resources