[HRPM] windoze word parsing

chicks at chicks.net chicks at chicks.net
Thu Oct 26 08:52:18 CDT 2000


On Thu, 26 Oct 2000, Troy E. Webster wrote:

> I need to open up a msword 2000 dcoument and parse through it,
> stripping out all the extraneous control characters and
> non-printables. End result will be a html document with custom
> formatting. Has any one done this before? Does anyone have any ideas
> for approaching this? Any advice besides the usual rtfm?

I use mswordview which I installed as an RPM:

Name        : mswordview
Version     : 0.5.2
Release     : 1
Group       : Utilities/Text
Size        : 2137284
License     : GPL
Vendor      : Caolan McNamara <Caolan.McNamara at ul.ie>
Packager    : Ryan Weaver <ryanw at infohwy.com>
URL         : http://www.csn.ul.ie/~caolan/docs/MSWordView.html
Summary     : MSWord 8 binary file format -> HTML converter
Description :
MSWordView is a program that understands the Microsoft Word 8
binary file format (Office97) and is able to convert Word
documents into HTML, which can then be read with a browser.

It does OK with some documents from 2000 and complains about others.
YMMV.

> ps. nice meeting last night, I learned alot from Matt's tk talk

It was very well done.  He's going to turn the slides into HTML after
making some minor corrections and we'll post it on norfolk.pm.org.

-- 
</chris>

"The number of Unix installations has grown to 10, with more expected." 
           -- The Unix Programmer's Manual, 2nd edition, June '72




More information about the Norfolk-pm mailing list