Won't display characters following '\267'

Nik · June 23, 2009, 8:03pm

Hello!

I use MySQL and making sure it is UTF-8 and in my view the character set is also UTF-8. But when I display the text whose input came from either an antiword.exe or WIN32OLE output of a MS Word document in a textarea. Text fail to show immediately after a strange character that shows up in rails console as \267. And I went back to Word to see what this is (looked it up by its position). And it is a dot sort of floating in middle of the line. Sort of like how they display chapters or whatever they call it of the Bible. like 12-7[dot]Matthrew

For example: Rails Console: >>doc="This is a pipe, but \267 this is not a pipe" HTML: This is a pipe, but It just sort of STOPS rendering the rest of the text.

I can't possibly ask my clients to remove that so to convenient me. I have been on a 38 hours hunt to try to find some solutions to it.

Some says remove all [^[:print:]] matches. Which I can do and find a way to at least preserve the \n\r's. But then again, I do want to preserve also as much of the original document as possible. I mean, what if they use umlauts the o with " on top.

Any ideas?

Thank You!

Philip_Hallstrom · June 23, 2009, 8:21pm

You could try...

require 'iconv'

clean_str = Iconv.new('UTF-8//Ignore', 'UTF-8').iconv(messy_str)

It doesn't always work though... you might need to catch
Iconv::InvalidCharacter...

Worth a try though and has gotten me out of some of this mess with bad
source data.

Nik · June 23, 2009, 10:46pm

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But then because all of my regular expressions did not account for these characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is... But what do you call \267 Is this that hex character business or octal, decimal?

And 2 - Just like that character \267 or 'dot' as I call it, how can I match it? And does it have a class name?

Lastly, 3 - and what charcode or other means can I systematically identify the accentuated characters as in the accent grave in French.

Thank You!

Matt_Jones · June 24, 2009, 3:48pm

You really need to translate the character encoding on that data - Rails is assuming that it's UTF-8, when (from your description of the character) it's either Windows-1252 or (possibly) ISO8859-1. Your previous problem was the default UTF-8 parser giving up, as \267 (B7 hex) is only a valid UTF-8 character inside a multibyte sequence.

--Matt Jones

Philip_Hallstrom · June 24, 2009, 3:52pm

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But then because all of my regular expressions did not account for these characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is... But what do you call \267 Is this that hex character business or octal, decimal?

It's unicode. A multi-byte, but single character.

And 2 - Just like that character \267 or 'dot' as I call it, how can I match it? And does it have a class name?

By matching the unicode via \267 yourself. This might give some
insight... Unidecode!

Lastly, 3 - and what charcode or other means can I systematically identify the accentuated characters as in the accent grave in French.

If the charcode is over what... 127 then it's not simple ASCII...

You might also find this plugin useful - http://github.com/rsl/stringex/tree - It will try and turn all that stuff into simple ASCII. You'll ose
the accents, etc, but that might be okay for what you're doing.

Nik · June 24, 2009, 10:06pm

Hey Matt, thanks for your help!

Here's what I do work\ruby script/console

doc = `c:\\antiword.exe c:\\test.doc`

=>"\n This is a pipe \267 but this is not a pipe.\n\r"

Bakery.create(:description=>doc)

=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: "2009-06-24 18:01:03", description: "\n This is a pipe \267 but this is not a pipe.\n\r">

Then go to http://localhost:3000/bakeries/55, where show.html.erb is simply <%= @bakery.description %> with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output:

This is a pipe

That's it, the entire process of what I do. I would want to try out your solution of translating the character encoding. Could it be that it is the same method as Phillip above suggested, by using Iconv? If so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!

Nik · June 24, 2009, 10:07pm

Hey Matt, thanks for your help!

Here's what I do work\ruby script/console

doc = `c:\\antiword.exe c:\\test.doc`

=>"\n This is a pipe \267 but this is not a pipe.\n\r"

Bakery.create(:description=>doc)

=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: "2009-06-24 18:01:03", description: "\n This is a pipe \267 but this is not a pipe.\n\r">

Then go to http://localhost:3000/bakeries/55, where show.html.erb is simply <%= @bakery.description %> with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output:

This is a pipe

That's it, the entire process of what I do. I would want to try out your solution of translating the character encoding. Could it be that it is the same method as Phillip above suggested, by using Iconv? If so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!

Matt_Jones · June 25, 2009, 4:17pm

Actually, doing some more digging, you should first try adding using the -m switch to antiword - the docs claim that:

antiword.exe -m utf-8 c:\test.doc

should convert the character set correctly. If nothing else, it should be easy to try out...

--Matt Jones

Nik · June 26, 2009, 8:38am

Hey Matt!

That saved the day for me. -- I am terribly sorry to brought this trouble up here. I did look for the documentation/reference/manual/ instruction on Google but some obscure links turned up instead. If I learned anything at all from you all , it'd be for me to look first at the dir of the app from now on.

Thank You! Case closed

Dan_Sharp · July 7, 2009, 9:45pm

Wow!

Someone else dealing with the exact same thing as me!

Matt: your suggestion to use the "-m utf-8" flag for antiword was exactly the right solution. Conceptually it makes the most sense, too. I.e.: "Convert this Word doc to UTF-8 and parse it into text" as the first step. Much much nicer!

It's good to know that Iconv could probably do the same thing later in the process, but it's nice to just handle it up-front and the resulting String object is already UTF-8. Whee!

Thank you! (my solution was much less than 38 hours, primarily thanks to this thread)

-Danimal

Nik · July 27, 2009, 2:39am

Hey, I am glad that my little post helped!!

Topic		Replies	Views
STRANGE CHARACTERS IN RAILS rubyonrails-talk	0	95	November 28, 2008
Encoding rubyonrails-talk	7	149	June 24, 2011
The dreaded Unicode issue rubyonrails-talk	7	115	January 8, 2007
Encoding issue (I think): All apostrophes have changed to these odd characters after a database switch rubyonrails-talk	8	853	June 1, 2011
csv/Rails 3/ruby 1.9.2p0/mysql encoding problem rubyonrails-talk	1	152	September 15, 2010

Won't display characters following '\267'

Related topics

More Resources