temp.each do |row|
@newhash << {:var1 => row[0], :var2 => row[1]}
end
Finally I create a new record out of the @newhash above, but I got an
error before, when I have a special character in the row:
"invalid byte sequence in UTF-8"
I have german special characters: ä, ö, ü
Without these characters, my code is working!!!!
How can avoid the error by using the right encoding?
If I use in my controller:
#encoding: CP850
or
#encoding: iso-8859-1
then the error message didn't appear, but the special character ü is
replaced by a question mark.
Looks like this: M�sli
I thought utf-8 is able to handle german special characters.
It took me the whole day and I still didn't come to a solution. I
really hope that someone can help me.
I personally haven’t had to deal with encoding issues yet, but remember reading couple of posts from Yehuda Katz (of merb fame and core contributor to rails) on that.
Maybe these can help you identify and fix your problem:
The articles are little long, but if you know a good deal about encodings, then you can skip towards end of the posts where he writes about how to deal with conversions.
thank you for the links. I will read them and look if there is
something that can help me.
I found out that the main problem was that my gvim editor saved every
*.rb file not in utf-8 encoding. I just edited them with notepad and
saved them explicitly in utf-8 and then the german special characters
worked in my controllers.
There is still the problem with the CSV class, which I need to import
a csv file. This class is not able to read the special characters.
I just opened my csv file with notepad and saved it with utf-8
encoding, then my original code is working perfectly and special
characters are shown normally.
file.temp is an object. I have a form where a csv can be uploaded, but
it is never stored. That's why I use tempfile. That means that I
probably have no path to use in that method.
BUT, the open and foreach method for the CSV class is working with an
object whenever I don't have a german special character in my csv file
or when my csv file is already in utf-8 encoding format.
it was working perfectly when I just made sure that my csv file is in
utf-8 encoding format.
I deleted some of my programm, so I had to write a lot of stuff again.
If I now upload a csv file which is in utf-8 format and then I have
every time in the first row that the first three character are: \xEF
\xBBxBF
That's a utf BOM: a magic unicode character that tells whoever is
reading the stream what endianness is and also allows to tell UTF8
apart from utf16
You can safely strip them from the file.
I read that these is something about unicode and ordering, but i don't
know where these hex chars come from.
Also every german special character is also shown in this hex code,
e.g. "k\xC3\xBChler" should be "kühler"
That is probably just an output thing if you are seeing this in a
terminal window- \xC3\xBC is the utf8 sequence for ü
Stripping the first chars is possible of course, but I don't
understand why these chars are there.
It was working before! I could just upload the utf-8 csv and everthing
was working great before. I don't really know what I changed that now
these chars are appearing.
Unicode uses them to indicate to the application reading the text file which order the following bytes are in. Since UTF-8 uses compound characters to indicate the scary-high end of the unicode character table (two bytes needed to encode some characters) the order that the bits arrived in is of critical importance. Text files may be little-endian or big-endian, and unless you know what order to expect, you can't really know.
thank you for your reply! In the meantime I figured out why this was
working without errors in my first code!
There I had some REGEX checks before saving each row into the
database. That means the first row always got skipped, because the
unicode indentifiers didn't fit to the REGEX.
Now I know where my fault is, but I don't really know how to solve it.
If the source csv is in utf-8 I can of course strip the first three
chars. But if it is in another encoding, that means I strip of chars
that I need. How can I check which encoding the file has? I tried this
here, but that gives me always CP850 as encoding: