What's a good solution for fixing character encoding problems for
compatibility between ascii and utf-8? The database is postgres and
is encoded in utf-8.
Once in awhile there will be a compatibility error from strings from a
webform.
Is there a command to fix this besides using
a_string.force_encoding('utf-8')? Even this doesn't seem to always
work either.
I ran into similar situation a while ago for a webservice app I was
working on where I had to handle a lot of bad / untrusted non-utf8
data, and found a fix that met the needs of the app using Iconv
(http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/index.html)
following a strategy outlined by Paul Battley (http://po-ru.com/diary/
fixing-invalid-utf-8-in-ruby-revisited/):
...
def AppUtil.force_utf8(str)
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
return ic.iconv("#{str} ")[0..-2]
end
...
Thanks for your response. I tried this on a string that was causing
the error and it didn't work. The problem is with microsoft word
special characters. I can't find a way to replace these characters.
Here is one website I found that describes the special characters:
The Fruits of my Labour,
although it's not about rails.
I'm using Rails in a Microsoft platform, so I can't rely use iconv, I had a lot of problems with encoding, and finally I solved with the attached script.
You probably need to figure out the actual encoding and explicitly convert from that to UTF-8. This is a snippet of code that I have in a real project:
open(DATAFEED_URI) do |file|
local_filename = local_path
local_filename.open('w') do |outf|
file.each do |line|
begin
outf.write Iconv.conv('UTF-8//TRANSLIT//IGNORE', 'WINDOWS-1252', line)
rescue Iconv::IllegalSequence => e
shlogger.error { "#{DATAFEED_URI} line #{file.lineno} could not be translated:\n#{line}" }
end
end
end
local_filename.open('r') {|opened| yield opened }
end
The part that you're going to be interested in is the line that calls Iconv and, in particular, the second argument of 'WINDOWS-1252' which is likely the encoding of your data. There are also a couple aliases for that code page:
I personally haven’t had to deal with encoding issues yet, but remember reading couple of posts from Yehuda Katz (of merb fame and core contributor to rails) on that.
Maybe these can help you identify and fix your problem:
The articles are little long, but if you know a good deal about encodings, then you can skip towards end of the posts where he writes about how to deal with conversions.
Thank you everyone for your responses. They are helped me figure out
a solution. This seems to work for my problem:
s = s.gsub("\xe2\x80\x9c", '"')
s = s.gsub("\xe2\x80\x9d", '"')
s = s.gsub("\xe2\x80\x98", "'")
s = s.gsub("\xe2\x80\x99", "'")
s = s.gsub("\xe2\x80\x93", "-")
s = s.gsub("\xe2\x80\x94", "--")
s = s.gsub("\xe2\x80\xa6", "...")
s = Iconv.conv('UTF-8//IGNORE', 'UTF-8', s)