Microsoft "stupid quotes" in params

I'm looking for feedback on this, read on, please...

I'm just fed up with Microsoft's "stupid quotes" feature (and for sake of later searchers, I'll add that it's often known as "smart quotes", although as with anything Microsoft you're safe to substitute the word "stupid" anywhere they use the word "smart").

I just completed a nice application, and suddenly an external piece failed. It first uses xmlrpc to grab some data from the database and stick it in a yaml file. A couple of other programs read the yaml file and create various other files. Those programs were crapping because it couldn't read the entire yaml file.

It turns out that the problem was with people using stupid quotes. Here's the sledgehammer that I applied in app/controllers/application.rb:

  before_filter :fix_stupid_quotes_in_params

  def fix_stupid_quotes_in_params     dig_deep(@params) { |s| fix_stupid_quotes!(s) }   end

  def dig_deep(hash, &block)     if hash.instance_of? String       yield(hash)     elsif hash.kind_of? Hash       hash.each_key { |h| dig_deep(hash[h]) { |s| } }     else       nil     end   end

  def fix_stupid_quotes!(s)     s.gsub!(/\x82/,',')     s.gsub!(/\x84/,',')     s.gsub!(/\x85/,'...')     s.gsub!(/\x88/,'^')     s.gsub!(/\x89/,'o/oo')     s.gsub!(/\x8b/,'<')     s.gsub!(/\x8c/,'OE')     s.gsub!(/\x91|\x92/,"'")     s.gsub!(/\x93|\x94/,'"')     s.gsub!(/\x95/,'*')     s.gsub!(/\x96/,'-')     s.gsub!(/\x97/,'--')     s.gsub!(/\x98/,'~')     s.gsub!(/\x99/,'TM')     s.gsub!(/\x9b/,'>')     s.gsub!(/\x9c/,'oe')   end

If this is a bad idea, I'll have to implement it on one particular page. The fact is, though, that these characters are always invalid (in Latin/UTF-8 type char sets) so I see no reason to allow them through ever. I hate modifying the params, but again, these are just not valid characters. I don't want to have to think about it in each model or controller.

This is a sledgehammer approach, as it will always walk through params on every page and fix the stupid quotes characters. I'm looking for any thoughts, suggestions, comments, etc. on the above code.

Thanks, Michael

IIRC, these are double-byte characters. The problem is not so much in
using them, but in interpreting them. For e.g., user types stuff into
Word, copies, then pastes into a textarea. Hits send. Application
obediently stores in database. Application displays data and s(mart| upid) quotes are in place correctly. Then programmer gets a wild idea
-- like restoring the database from a backup. Because the backup is a
text file, the DBCs are misinterpreted as they are imported into the
database. Result, improper display of these characters.

If you come up with a solution that works for these cute characters
that (whatever you call them) everyone has in their word processing
documents, let us all know. Here are some references that sort of work:

demoronizer (Perl script)

I can't attribute this second one, but it's a shell script. I tried
it on a database dump and it left me with less cleanup work -- maybe
it will provide some clues for you:

#!/bin/sh this_directory=`pwd` for x do echo -n "converting $x: " if test "$x" =; then echo "not editing script itself!" elif [ -d $x ]; then (cp $x; cd $x; sh *; rm -f cd .. ) elif test -s $x; then iconv --from-code=euc-kr --to-code=UTF-8 < $x > $this_directory/$x$$ ; if [ $? == 0 ] then cp $this_directory/$x$$ $x rm -f $this_directory/$x$$ else echo -n "ICONVE ERROr " rm -f $this_directory/$x$$ fi echo "done"; else echo "original file is empty" fi done echo "all done"