Encoding issue (I think): All apostrophes have changed to these odd characters after a database switch

I'm wondering if anyone can give any insight into how I could resolve the problem on this website:

basically, all the ’ are supposed to be apostrophes ( ' ), and quotes are messed up too...

Is it possible to run some command in the "rails console production" to fix this?

Really appreciate any help!!

I'm wondering if anyone can give any insight into how I could resolve the problem on this website:

http://jdrampage.com/

basically, all the ’ are supposed to be apostrophes ( ' ), and quotes are messed up too...

Is it possible to run some command in the "rails console production" to fix this?

That does indeed look like an encoding issue. I assume that your ' were in fact curly quotes. This kind of thing can happen when there is a mismatch between the encoding the database is using and what rails is using.

For example if rails is using utf8, but the database connection is set to CP1252 then in order to save the curly quote character, your ruby script would send the bytes E3 80 99 which is the utf8 sequence for the uncode right single quotation mark (U2019). If your db connection is set to be latin1 (or any similar single byte encoding) then it will happily store that byte sequence as it is.

If now your app were to start doing the right thing and ask the db for utf8 then converts what it things is latin1 (but is actually already utf8) into utf8 a second time and so you get garbage (in cp1252 E3 80 99 is ’ which is what I see on your website). In order to fix this you typically want to tell the database to reinterpret the contents of text columns as utf8. How exactly depends on your database, but in mysql something like

alter table foos MODIFY some_column BLOB; alter table foos MODIFY some_column TEXT CHARACTER SET utf8;

will reinterpret whatever is in some_column as utf8. This might not be exactly what you need - experiment with your data to see exactly what what has happened - I once had a case where text was going through this double encoding process twice so I had to repeat the above commands twice to straighten out the data). Once you've sorted things out, make sure you don't fall into this hole again by making sure that all your databases and tables have their default encoding set to utf8

Fred

So... the old database used what type of encoding? latin1? And the new one uses utf8?

Is it a problem if some articles were already cleaned by doing a search and replace, e.g. swapping all ’ for its corresponding proper symbol?

Really appreciate the help!

So... the old database used what type of encoding? latin1? And the new one uses utf8?

It's your database, you tell me!

Is it a problem if some articles were already cleaned by doing a search and replace, e.g. swapping all ’ for its corresponding proper symbol?

Depends. if you have replaced with pure ascii then it's not a problem. if not (ie for a given column and table you have a mix of encodings) then you will have made things worse.

Fred

Oh... thank you for answering me! Basically, some people ran search- and-replaces for these:

"| replaced with : – replaced with - ’ replaced with ' “ replaced with " †replaced with "

I apologize for my lack of expertise, but have things been replaced "with pure ascii"? (What exactly is that...?)

One other thing: Does the encoding or interpretation of encoding vary from browser to browser? See, I went to school and checked out the site - only to find this odd symbol located after double-quotes... it looked like two squares on top of each other... Yet, at home, or outside of school, I do not see this symbol anywhere. Why might this be?

(Thank you again!)

If the browser is using a typeface (font) that doesn't include the precise character that your page encoding and HTML require, then you won't see that character. The glyph you describe sounds like the "missing glyph" character, and that's why I'm guessing you're seeing it.

Another layer to this cake.

Walter

But the missing glyph character should be replaced with nothing. In fact, if I could do so easily, I would just delete all these "missing glyph" characters... Is there anything you recommend I do about the missing glyph? I mean I don't even see it on my home computer... only on the computers at school.

The missing glyph character is a feature of many different fonts -- it means literally, "I don't have any glyph by that name in my table". The way you "get rid of it" is by providing an encoding and substitution escapes that convert the wide, wild world of Unicode typography into something that the more limited browser/OS combinations can handle.

There are fonts that specialize in having suitably large collections of characters to print nearly anything besides Klingon. These will often have the word Unicode in their name. Many, if not most, core Mac fonts are Unicode-aware, and if you are writing out a CSS font-family that you mean to cover the most possible characters, you will add the Microsoft variants of those to your list:

  font-family: "Lucida Grande", "Lucida Sans Unicode", "Lucida Sans", Lucida, Geneva, Verdana, sans-serif;

In order for even this font family to work, the user will have to install a modern version of their operating system, and maybe a modern browser, and you can't know or control that at all. But you can and should declare a character encoding through your DOCTYPE and meta tags, and your server should send a content-type header that includes a charset attribute. All of these should match the encoding within your database, and within the other content served by your Web server. One charset to rule them all!

If your data is stuck in a particular charset, and you can't figure out how to convert it into UTF-8, then you need to modify everything -- starting with Rails -- to recognize that the content is in that encoding, and to treat it as such. Then you also need to specify the encoding in the generated HTML, so your /layouts/application.html.erb or local equivalent should have this line in it somewhere:

<meta http-equiv="Content-type" content="text/html; charset=YourEncodingHere" />

Walter

> > Is it a problem if some articles were already cleaned by doing a > > search and replace, e.g. swapping all ’ for its corresponding proper > > symbol?

> Depends. if you have replaced with pure ascii then it's not a problem. > if not (ie for a given column and table you have a mix of encodings) > then you will have made things worse.

Oh... thank you for answering me! Basically, some people ran search- and-replaces for these:

"| replaced with : – replaced with - ’ replaced with ' “ replaced with " †replaced with "

I apologize for my lack of expertise, but have things been replaced "with pure ascii"? (What exactly is that...?)

Those are pure ascii characters (ie can be represented by a 7 bit integer. Latin1, UTF8, etc. all represent these characters in the same way so you shouldn't get a problem

One other thing: Does the encoding or interpretation of encoding vary from browser to browser? See, I went to school and checked out the site - only to find this odd symbol located after double-quotes... it looked like two squares on top of each other... Yet, at home, or outside of school, I do not see this symbol anywhere. Why might this be?

as walter says, that sounds the missing glyph character, ie "you've asked me to display a character that I can't display"

Fred