Strange problem with CSV and funny chars

Colin_Law1 · December 5, 2010, 9:02pm

I am using CSV in a rake task (db:seed) on Rails 3.0.3, Ruby 1.9.2 to read a file with some funny chars in it. Upon breaking in at a point where the row read using CSV is in variable row, with the string with the char in row['price'] I get the following strange results which I cannot understand.

(rdb:1) row['price'] "\xA32.00" (rdb:1) row['price'][0] "\xA3" (rdb:1) row['price'][0] == "\xA3" false (rdb:1) row['price'][0].each_byte{|c| print c, ' '} 163 "\xA3" (rdb:1) "\xA3".each_byte{|c| print c, ' '} 163 "\xA3" (rdb:1) "\xA3".class String (rdb:1) row['price'][0].class String (rdb:1) row['price'][0] <=> "\xA3" -1 (rdb:1) "\xA3" <=> row['price'][0] 1 (rdb:1) row['price'][0].length 1 (rdb:1) "\xA3".length 1

So it appears that "\xA3" and row['price'][0] are both strings of length 1 and both contain the byte value 163 yet "\xA3" is definitely greater than row['price'][0] If I do c1 = row['price'][0] and c2 = "\xA3" I still get the same effect. The variables c1 and c2 contain the same data but are different when compared.

No doubt I am doing something stupid, if someone could point out what, then I would be most grateful.

Colin

11155 · December 6, 2010, 3:13am

Are you reading from xml to csv?

Line feed (newline) &# 10; hexadecimal rep is &# xA;

Colin_Law1 · December 6, 2010, 9:30am

I woke up in the middle of the night and realised that this must be an encoding issue.

If I check the encoding of the two strings then I see that '\xa3' is utf-8 but the data read by csv is ascii-8bit.

(rdb:1) '\xa3'.encoding.name "UTF-8" (rdb:1) row['price'].encoding.name "ASCII-8BIT"

This makes sense as CSV is reading an ascii text file. So it appears that in ruby 1.9.2 two strings that have the same contents and display the same, but are of different encodings, do not compare equal. Whether they should compare or not I do not know.

Colin

Colin_Law1 · December 6, 2010, 4:16pm

Just in case anybody has a similar problem and finds this in the future here is what I had to do sort out the issue. I needed to convert \xA3 chars in the ascii data read by CSV into UK Pound signs. I had the same encoding issues with the regular expression and this is what I had to do to achieve the desired effect

At the top of the file (seeds.rb) #encoding: utf-8 ... regex = Regexp.new( "\xA3".force_encoding('ASCII-8BIT') ) Then to do the sub row['price'] = row['price'].gsub( regex, '£'.force_encoding('ASCII-8BIT') )

Then when it came to updating the ActiveRecord object with the data read by CSV I had to force it to utf-8 model.price = row['price'].force_encoding('UTF-8')

This all works but I have to say that I am not sure that I fully understand all the encoding issues, so there may well be better ways.

Colin

Topic		Replies	Views
Problem with special characters and CSV upload rubyonrails-talk	16	226	July 5, 2011
csv/Rails 3/ruby 1.9.2p0/mysql encoding problem rubyonrails-talk	1	152	September 15, 2010
Encoding issue with file upload rubyonrails-talk	2	181	August 24, 2009
Problem for CSV import rubyonrails-talk	2	133	April 12, 2010
reading in from csv files rubyonrails-talk	2	267	June 15, 2018

Strange problem with CSV and funny chars

Related topics

More Resources