Strange problem with CSV and funny chars

I am using CSV in a rake task (db:seed) on Rails 3.0.3, Ruby 1.9.2 to read a file with some funny chars in it. Upon breaking in at a point where the row read using CSV is in variable row, with the string with the char in row['price'] I get the following strange results which I cannot understand.

(rdb:1) row['price'] "\xA32.00" (rdb:1) row['price'][0] "\xA3" (rdb:1) row['price'][0] == "\xA3" false (rdb:1) row['price'][0].each_byte{|c| print c, ' '} 163 "\xA3" (rdb:1) "\xA3".each_byte{|c| print c, ' '} 163 "\xA3" (rdb:1) "\xA3".class String (rdb:1) row['price'][0].class String (rdb:1) row['price'][0] <=> "\xA3" -1 (rdb:1) "\xA3" <=> row['price'][0] 1 (rdb:1) row['price'][0].length 1 (rdb:1) "\xA3".length 1

So it appears that "\xA3" and row['price'][0] are both strings of length 1 and both contain the byte value 163 yet "\xA3" is definitely greater than row['price'][0] If I do c1 = row['price'][0] and c2 = "\xA3" I still get the same effect. The variables c1 and c2 contain the same data but are different when compared.

No doubt I am doing something stupid, if someone could point out what, then I would be most grateful.

Colin

Are you reading from xml to csv?

Line feed (newline) &# 10; hexadecimal rep is &# xA;

I woke up in the middle of the night and realised that this must be an encoding issue.

If I check the encoding of the two strings then I see that '\xa3' is utf-8 but the data read by csv is ascii-8bit.

(rdb:1) '\xa3'.encoding.name "UTF-8" (rdb:1) row['price'].encoding.name "ASCII-8BIT"

This makes sense as CSV is reading an ascii text file. So it appears that in ruby 1.9.2 two strings that have the same contents and display the same, but are of different encodings, do not compare equal. Whether they should compare or not I do not know.

Colin

Just in case anybody has a similar problem and finds this in the future here is what I had to do sort out the issue. I needed to convert \xA3 chars in the ascii data read by CSV into UK Pound signs. I had the same encoding issues with the regular expression and this is what I had to do to achieve the desired effect

At the top of the file (seeds.rb) #encoding: utf-8 ... regex = Regexp.new( "\xA3".force_encoding('ASCII-8BIT') ) Then to do the sub row['price'] = row['price'].gsub( regex, '£'.force_encoding('ASCII-8BIT') )

Then when it came to updating the ActiveRecord object with the data read by CSV I had to force it to utf-8 model.price = row['price'].force_encoding('UTF-8')

This all works but I have to say that I am not sure that I fully understand all the encoding issues, so there may well be better ways.

Colin