SPUG: Day meeting in Bellevue

John Costello cos at indeterminate.net
Wed Dec 7 15:39:16 PST 2005

On Wed, 7 Dec 2005, DeRykus, Charles E wrote:
> On Wed, Dec 07, 2005 at 12:50:19PM -0800, John Costello wrote:
> > Duane,
> > In advance of the meeting:  Could someone point me to an app 
> > (preferable) or C library (less preferable but oh well) that decodes 
> > MS Word docs? John
> > -----
> > John Costello - cos at indeterminate dot net
> >>As someone who's been recently forced to convert a large manuscript into Word 
> >>(my upcoming Perl book), I suddenly find myself in need of a grep-like utility 
> >>for word docs.
> >>I'd naturally prefer an Open Source, Perlish solution, but I'd consider other 
> >>options that do the job well. Apart from using regexes to match and extract plain 
> >>text, I'd like to  match text by /attributes/ such as style and font in
> >>addition to character patterns (JGsoft's $149 "powergrep" sounds like "strings 
> >>file.doc | grep 'pattern'", which isn't quite good enough.)  
> >>I know Word has a built-in "find" utility with its own (lame) regex dialect, but 
> >>I need to automate my searches, not babysit them with mouse in hand.
> May not help but the Open Source 'antiword' does a better job than 'strings' at 
> yanking text out of Word while preserving formatting. Feeding the stream into Perl
> should be a win in many cases...

Funny you mention that.  I've been looking at antiword's code today, to
see what it can divulge from a word doc, but didn't pay attention to its
output methods.  I assumed it just imported documents.  Silly of me to 
make that assumption, but I'm blaming my nascent head cold.

> --
> Charles DeRykus

John Costello - cos at indeterminate dot net
"You cannot propel yourself forward by patting yourself on the back."--Unknown

More information about the spug-list mailing list