SPUG: Modifying Word *.doc files using s///g

Tim Maher tim at consultix-inc.com
Tue Mar 28 11:47:55 PST 2006


SPUGsters,

I need to modify the contents of some "index tag" entries
in a large number of MS-Word files. I was worried that this 
might be difficult--or impossible--but it turned out to
be relatively easy; Go Perl!

As you might expect, Word sees a modified document
as corrupted if you do anything to disrupt the checksum
comparison, but  simply moving text from one place to
another (at least within an "Index Tag") doesn't bother
it a bit!

Here's the script, for those who might find it useful.

I don't mess with binary files more than once a decade or
so, so I'd be interested in any comments on how I might
have done this better or more portably.  FYI, I've only
run the script on Linux, on *.doc files.

#! /usr/bin/perl -w
# Tim Maher, tim at TeachMePerl.com
# Tue Mar 28 11:36:22 PST 2006
# word_edit: for moving <$startrange>, <$endrange> from right-end of
# MS-Word index tag, where it doesn't hurt index display order, to left
# end, where FrameMaker needs to find it
# NOTE: Word will see output document as corrupted if byte count changes!
# Luckily, I don't need to do that right now ...

$/=undef;	# file mode

foreach $f (@ARGV) {
	open IN, "<$f" or die "$0: Failed to open $f\n";
	open OUT, ">o$f" or die "$0: Failed to open o$f\n";

	binmode  IN or die "$0: binmode error: $!\n";
	binmode OUT or die "$0: binmode error: $!\n";

	$data=<IN>;	# avoid record separator complications
	warn "Bytes read: ", length $data,  "\n";
	(-s $f) == length $data or die;

	# Index Tag starter is \023, ender is \025
	# Format: \023 XE "stuff" \025

	# Move $X at right-end of "stuff" to its left end
	for $X ('<$startrange>','<$endrange>') {
		$data =~
		s/
			(\023\ +XE\ +")	# tag-starting code
			([^\023\025]+)	# tag entry
			\Q$X\E		# range marker to move
			("\ +\025)	# tag-ending code
		/$1$X$2$3/gx;
	}
	print OUT $data; close IN or warn; close OUT or warn; 
	warn "Bytes to write: ", length $data,  "\n";
	warn "Bytes written: ", -s "o$f", "\n";
}
*-------------------------------------------------------------------*
|  Tim Maher, PhD  (206) 781-UNIX   (866) DOC-PERL  (866) DOC-UNIX  |
|  tim at ( Consultix-Inc, TeachMePerl, or TeachMeUnix ) dot Com    |
*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*
|Classes: 4/10 Shell & Utilities  4/25 Object-O Perl  5/15 Perl/CGI |
| Watch for my upcoming book: "Minimal Perl for UNIX/Linux People"  |
|  See MinimalPerl.com for details, ordering, and email-list signup |
*-------------------------------------------------------------------*


More information about the spug-list mailing list