How to Parse Microsoft Word Document

Hi People,

I just joined the group and I want to ask something about my problem.
I'm still learning Ruby on Rails and now I have a task to parse
Microsoft Word and store the content into database.

Do you have any suggestion how to do it?

FYI, I develop it under Unix Environment. So, I don't have a chance to
use win32ole on it, CMIIW.

I also have searched the internet about this. But all I found that I
need to use JRuby and combine it with Apache POI or else I need to use
win32ole. As far as I know, to use JRuby I need to create the rails
project also with JRuby but unfortunately I already created the
project with plain Ruby.

So, I don't know what to do anymore. Does anybody have clue?

Regards,

Hafiz Badrie Lubis

You can run poi as a separate process and then grab its output.

Hi People,

I just joined the group and I want to ask something about my problem.
I'm still learning Ruby on Rails and now I have a task to parse
Microsoft Word and store the content into database.

Do you have any suggestion how to do it?

FYI, I develop it under Unix Environment. So, I don't have a chance to
use win32ole on it, CMIIW.

I also have searched the internet about this. But all I found that I
need to use JRuby and combine it with Apache POI or else I need to use
win32ole. As far as I know, to use JRuby I need to create the rails
project also with JRuby but unfortunately I already created the
project with plain Ruby.

So, I don't know what to do anymore. Does anybody have clue?

I did a project in PHP quite a few years ago, and I used some venerable unix cli converters to do this. I stored the files as is, and then used these converters to rip out their text and stored that in the database for searching. They aren't perfect, but they do a good enough job for search results.

$translators = array(
  'pdf' => '/usr/local/bin/pdftotext ./pdf/%s.pdf -',
  'ppt' => '/usr/local/bin/catppt -d ascii ./ppt/%s.ppt',
  'xls' => '/usr/local/bin/xls2csv -d ascii ./xls/%s.xls',
  'doc' => '/usr/local/bin/catdoc -d ascii ./doc/%s.doc'
); //these translators all pipe to stdout, which means that shell_exec will return their text value

Walter

1. Convert .doc to .pdf with PyODConverter
http://www.artofsolving.com/opensource/pyodconverter

2. Convert .pdf to .tiff with ImageMagick

3. Process .tiff through Tesseract OCR and get .txt

Can you show it to me how to do it? Do you have a reference?
To make a collaboration between a rails project with JRuby codes.

It has nothing whatsoever to do with JRuby. You can run Java apps from Ruby exactly like any other command-line process. I don't know if POI is just a library, or has a full app utility as well. If it's just a lib, you'd have to write the program, probably a half-dozen lines of Java.

Wow, talk about a long slow way to potentially lose text flow and introduce errors...

Ok thank you, Scott.
I'll try your advice.

I'm coming to this late and I've partially deleted the thread, so I may be way off base...

An old plugin might be of help:

https://github.com/kete/convert_attachment_to

It makes use of existing command line converter utility programs.

Cheers,
Walter