Strange problem with CSV and funny chars

I am using CSV in a rake task (db:seed) on Rails 3.0.3, Ruby 1.9.2 to
read a file with some funny chars in it. Upon breaking in at a point
where the row read using CSV is in variable row, with the string with
the char in row['price'] I get the following strange results which I
cannot understand.

(rdb:1) row['price']
(rdb:1) row['price'][0]
(rdb:1) row['price'][0] == "\xA3"
(rdb:1) row['price'][0].each_byte{|c| print c, ' '}
163 "\xA3"
(rdb:1) "\xA3".each_byte{|c| print c, ' '}
163 "\xA3"
(rdb:1) "\xA3".class
(rdb:1) row['price'][0].class
(rdb:1) row['price'][0] <=> "\xA3"
(rdb:1) "\xA3" <=> row['price'][0]
(rdb:1) row['price'][0].length
(rdb:1) "\xA3".length

So it appears that "\xA3" and row['price'][0] are both strings of
length 1 and both contain the byte value 163 yet "\xA3" is definitely
greater than row['price'][0]
If I do c1 = row['price'][0] and c2 = "\xA3" I still get the same
effect. The variables c1 and c2 contain the same data but are
different when compared.

No doubt I am doing something stupid, if someone could point out what,
then I would be most grateful.


Are you reading from xml to csv?

Line feed (newline) &# 10; hexadecimal rep is &# xA;

I woke up in the middle of the night and realised that this must be an
encoding issue.

If I check the encoding of the two strings then I see that '\xa3' is
utf-8 but the data read by csv is ascii-8bit.

(rdb:1) '\xa3'
(rdb:1) row['price']

This makes sense as CSV is reading an ascii text file. So it appears
that in ruby 1.9.2 two strings that have the same contents and display
the same, but are of different encodings, do not compare equal.
Whether they should compare or not I do not know.


Just in case anybody has a similar problem and finds this in the
future here is what I had to do sort out the issue. I needed to
convert \xA3 chars in the ascii data read by CSV into UK Pound signs.
I had the same encoding issues with the regular expression and this is
what I had to do to achieve the desired effect

At the top of the file (seeds.rb)
#encoding: utf-8
regex = "\xA3".force_encoding('ASCII-8BIT') )
Then to do the sub
row['price'] = row['price'].gsub( regex, '£'.force_encoding('ASCII-8BIT') )

Then when it came to updating the ActiveRecord object with the data
read by CSV I had to force it to utf-8
model.price = row['price'].force_encoding('UTF-8')

This all works but I have to say that I am not sure that I fully
understand all the encoding issues, so there may well be better ways.