Multibyte Character References

Mark Dodwell wrote:


I have a load of records in my database which were imported through
processing a YAML file. These original YAML files were created from the
'to_yaml' function of an array of Hash objects.

The YAML file contains multibyte character references such as:

...and between them and today\xE2\x80\x99s College. The scope, r...

When I imported this data into my DB these character references have
changed but are still there in the DB:

...and between them and today\342\200\231s College. The scope, r...

So I have two questions:

1) Are the original characters retreivable from the copy in the DB, or
has it been mangled?

2) If the above answer is yes, then how!

Really appreciate any help on this one. Many thanks in advance.

~ Mark

What's the encoding in the YAML file (presumably UTF-8), what database are you using and what encoding is your database/table set to?

Mark Dodwell wrote:

Hi Michael,

The DB is 'ISO Latin 1 (latin1)' encoding.

I'm not sure about the original YAML file (do you know the default encoding for .to_yaml?) - but when I open it directly with, say TextMate, it shows the character reference *not* the actual character.


~ Mark

MySQL, if that's what you are using, let's you set the character encoding at various different levels (server, database, table, column). If you are using MySQL you could try something like an ALTER TABLE to change the encoding to UTF-8 (which I'm guessing is what the original YAML data is in). You might have to export the data and import it into a table that's already set to UTF-8, though, in which case if you still have all the YAML data around it might be easier just to reload that with the table set to the proper encoding.