The dreaded Unicode issue

Hi all,

I’m trying to store the following in a mysql DB but am having no luck:

  1. Pär
  2. Björn
  3. gösta (hopefully the encodings came out correct in this message)

I’ve looked into Ruby’s Unicode library and have tried things such as:

Unicode.normalize_D(nickname) Unicode.normalize_KD(nickname) Unicode.normalize_C(nickname) Unicode.normalize_KC(nickname)

All those methods either seem to strip out the European characters or replace them with their English equivalents. All I want to do is store the text in its original form.

If I try to store the string as it currently is, I get question marks showing up in all my views where the European characters are.

Does anyone have any ideas on how to accomplish such a thing?

Thanks!

-Dave Hoefler

I'm trying to store the following in a mysql DB but am having no luck:

1. Pär 2. Björn 3. gösta

<...>

If I try to store the string as it currently is, I get question marks showing up in all my views where the European characters are. Does anyone have any ideas on how to accomplish such a thing?

It may well be, that your characters are ok in DB, but there is different charset specified in your HTTP headers. Check what headers server is sending (you can do it with the help of LiveHTTPHeaders or Firebug extensions for Firefox). Or the simplest test - check out, what menu item is checked on View-Character Encoding menu in your browser. If UTF is not checked, try to check it and see if characters are OK. If yes, that means you have to fix HTTP headers. If no, then problems may indeed be in the database.

Regards, Rimantas

Thanks Rimantas,

I checked out the headers and they are sending: Accept-Charset: ISO-8859-1,utf-8;q=0.7,*,q=0.7

My database is using ISO Latin 1.

I failed to mention before that I’m loading a ruby script of users in a text file and executing it from the terminal.

If I were to enter Unicode characters through a form in my Rails view then everything works fine. “Bjorn” is properly saved as “Bj\303\266rn”. It’s just loading the text file and looping through the records that seems to be the problem.

The important parts of the script:

$KCODE = ‘u’ require ‘jcode’ require ‘unicode’

arr = IO.readlines(member_file) # member_file is actually a text file member = Member.new (nickname) member.save

That obviously happens within a loop. I just wanted to uber-simplify it.

Still kind of confused…

-Dave

You might want to change that to UTF-8. Your accented unicode characters (western european) use a single byte under latin-1 and 2 under utf-8.

Every link in the unicode chain (the round-trip) has to be fixed.

I checked out the headers and they are sending: Accept-Charset: ISO-8859-1,utf-8;q=0.7,*,q=0.7

These are sent by browser, how does server response look like? Look for the 'Content-type' header.

My database is using ISO Latin 1.

<...>

This may or may not be the problem. It is possible (but not advisable) to keep utf data in database tables with Latin 1 encoding. However it would be better to convert your tables (and data) to UTF, then make sure that you have line 'encoding: utf8' in your config/database.yml

This part is confusing, so I'll try to describe some scenarios: a) Database's encoding is set as Latin 1 (default) and there is no encoding line in config/database.yml. This will work, at least data can be saved/retrieved and displayed correctly in the browser - given the HTTP headers are right.

b) Database's encoding is Latin 1, but database.yml sets encoding to UTF. This will not work, because that line makes database client to issue "SET NAMES UTF8" when connecting to the database. In this case MySQL will try to convert data which it thinks is in Latin 1 to UTF. Alas, data is already UTF, but wrong encoding was specified–what you get is useless conversion and character junk.

c) Like b), but in this case database is set to UTF, and data is in UTF, but there is no encoding line in database.yml. In this case connection's encoding will be Latin 1 and MySQL will try to convert data from UTF to Latin 1. This will mess things, assuming HTTP headers/meta tag says that encoding is UTF.

So what would I do in your situation: 1) Try to change view-encoding to UTF-8 in the browser. If that fixes how characters are displayed, then, most likely, HTTP headers are messed up. 2) Check 'Content-type' header of the server response. If it says anything else than UTF - that must be fixed. 3) When headers are ok and page is displayed the way it should - convert tables and data in DB to UTF and then add encoding: utf8 to config/database.yml 4) Enjoy.

Of course other scenarios are likely too, which would require additional exploration.

Regards, Rimantas

Thank you very much Rimantas for that explanation. That was just what I needed. I will try your suggestions.

-Dave

This response is someone late, but if you're using migrations, here's one to put your MySql tables and database to UTF8:

class UseUtf8AsDefaultCharset < ActiveRecord::Migration    def self.my_number_is      say File.basename(__FILE__).to_i    end

   def self.database_charset(charset='utf8')      execute "ALTER DATABASE `#{current_database}` DEFAULT CHARACTER SET #{charset}"    end

   def self.table_charset(table, charset='utf8')      execute "ALTER IGNORE TABLE `#{table}` CONVERT TO CHARACTER SET #{charset}, DEFAULT CHARACTER SET #{charset}"    end

   def self.up      my_number_is      database_charset('utf8')      select_values("SHOW TABLES").each do |table|        table_charset(table, 'utf8')      end    end

   def self.down      my_number_is      database_charset('latin1')      select_values("SHOW TABLES").each do |table|        table_charset(table, 'latin1')      end    end end

Note that the down migration puts the tables back to their "original" character set which, in my case, was latin1.

-Rob

Rob Biedenharn http://agileconsultingllc.com Rob@AgileConsultingLLC.com

Thanks Rob! That will come in handy.

-Dave