Won't display characters following '\267'

Hello!

I use MySQL and making sure it is UTF-8 and in my view the character set is also UTF-8. But when I display the text whose input came from either an antiword.exe or WIN32OLE output of a MS Word document in a textarea. Text fail to show immediately after a strange character that shows up in rails console as \267. And I went back to Word to see what this is (looked it up by its position). And it is a dot sort of floating in middle of the line. Sort of like how they display chapters or whatever they call it of the Bible. like 12-7[dot]Matthrew

For example:   Rails Console:   >>doc="This is a pipe, but \267 this is not a pipe"   HTML:   <p>     This is a pipe, but   </p> It just sort of STOPS rendering the rest of the text.

I can't possibly ask my clients to remove that so to convenient me. I have been on a 38 hours hunt to try to find some solutions to it.

Some says remove all [^[:print:]] matches. Which I can do and find a way to at least preserve the \n\r's. But then again, I do want to preserve also as much of the original document as possible. I mean, what if they use umlauts the o with " on top.

Any ideas?

Thank You!

You could try...

require 'iconv'

clean_str = Iconv.new('UTF-8//Ignore', 'UTF-8').iconv(messy_str)

It doesn't always work though... you might need to catch
Iconv::InvalidCharacter...

Worth a try though and has gotten me out of some of this mess with bad
source data.

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But then because all of my regular expressions did not account for these characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is... But what do you call \267 Is this that hex character business or octal, decimal?

And 2 - Just like that character \267 or 'dot' as I call it, how can I match it? And does it have a class name?

Lastly, 3 - and what charcode or other means can I systematically identify the accentuated characters as in the accent grave in French.

Thank You!

You really need to translate the character encoding on that data - Rails is assuming that it's UTF-8, when (from your description of the character) it's either Windows-1252 or (possibly) ISO8859-1. Your previous problem was the default UTF-8 parser giving up, as \267 (B7 hex) is only a valid UTF-8 character inside a multibyte sequence.

--Matt Jones

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But then because all of my regular expressions did not account for these characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is... But what do you call \267 Is this that hex character business or octal, decimal?

It's unicode. A multi-byte, but single character.

And 2 - Just like that character \267 or 'dot' as I call it, how can I match it? And does it have a class name?

By matching the unicode via \267 yourself. This might give some
insight... Unidecode!

Lastly, 3 - and what charcode or other means can I systematically identify the accentuated characters as in the accent grave in French.

If the charcode is over what... 127 then it's not simple ASCII...

You might also find this plugin useful - http://github.com/rsl/stringex/tree   - It will try and turn all that stuff into simple ASCII. You'll ose
the accents, etc, but that might be okay for what you're doing.

Hey Matt, thanks for your help!

Here's what I do work\ruby script/console

doc = `c:\\antiword.exe c:\\test.doc`

=>"\n This is a pipe \267 but this is not a pipe.\n\r"

Bakery.create(:description=>doc)

=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: "2009-06-24 18:01:03", description: "\n This is a pipe \267 but this is not a pipe.\n\r">

Then go to http://localhost:3000/bakeries/55, where show.html.erb is simply <p> <%= @bakery.description %> <p> with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output: <p>

       This is a pipe

</p>

That's it, the entire process of what I do. I would want to try out your solution of translating the character encoding. Could it be that it is the same method as Phillip above suggested, by using Iconv? If so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!

Hey Matt, thanks for your help!

Here's what I do work\ruby script/console

doc = `c:\\antiword.exe c:\\test.doc`

=>"\n This is a pipe \267 but this is not a pipe.\n\r"

Bakery.create(:description=>doc)

=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: "2009-06-24 18:01:03", description: "\n This is a pipe \267 but this is not a pipe.\n\r">

Then go to http://localhost:3000/bakeries/55, where show.html.erb is simply <p> <%= @bakery.description %> <p> with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output: <p>

       This is a pipe

</p>

That's it, the entire process of what I do. I would want to try out your solution of translating the character encoding. Could it be that it is the same method as Phillip above suggested, by using Iconv? If so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!

Actually, doing some more digging, you should first try adding using the -m switch to antiword - the docs claim that:

antiword.exe -m utf-8 c:\test.doc

should convert the character set correctly. If nothing else, it should be easy to try out...

--Matt Jones

Hey Matt!

That saved the day for me. -- I am terribly sorry to brought this trouble up here. I did look for the documentation/reference/manual/ instruction on Google but some obscure links turned up instead. If I learned anything at all from you all , it'd be for me to look first at the dir of the app from now on.

Thank You! Case closed

Wow!

Someone else dealing with the exact same thing as me!

Matt: your suggestion to use the "-m utf-8" flag for antiword was exactly the right solution. Conceptually it makes the most sense, too. I.e.: "Convert this Word doc to UTF-8 and parse it into text" as the first step. Much much nicer!

It's good to know that Iconv could probably do the same thing later in the process, but it's nice to just handle it up-front and the resulting String object is already UTF-8. Whee!

Thank you! (my solution was much less than 38 hours, primarily thanks to this thread)

-Danimal

Hey, I am glad that my little post helped!!