fix tidy_bytes for 1.9.x, improve performance

Hi all,

I was wondering if I could get some feedback on a patch I created for ActiveSupport's `tidy_bytes` method.

Right now `tidy_bytes` doesn't work with 1.9.x, since it relies on a Unicode regexp that always fails for strings with invalid UTF-8 characters. You can see the essence of the problem easily by firing up any 1.9.x irb and doing this:

ruby\-1\.9\.2\-preview1 > "\\x93"\.split\(//u\)
ArgumentError: invalid byte sequence in UTF\-8
        from \(irb\):2:in \`split'
        from \(irb\):2
        from

/Users/norman/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>

This patch resolves the issue by traversing the string as bytes rather than codepoints, and is about twice as fast as the current implementation. Rather than using the current implementation's regular expression, it checks each byte's first 0 bit to determine its validity. This Wikipedia article was a useful reference while working on the patch:

It also adds a `force` option to allow cleanup of byte sequences that are both valid CP-1252 / ISO-8859-1 and UTF-8. This can be used when the developer knows that their input is encoded in CP-1252 or ISO-8859-1 and wants to recode it to UTF-8. (Again, the presence of invalid characters will prevent doing this by simply using #encode or #force_encoding on 1.9.)

* The patch: http://gist.github.com/361115 * LH Ticket: #4350 tidy_bytes fails on 1.9.x - Ruby on Rails - rails

Here is also a library where you can see this code in isolation:

http://github.com/norman/utf8_utils

Regards,

Norman

This is great. Thanks, Norman!

jeremy