SPUG: Here's my de-contraction-ating program
Tim Maher
tim at consultix-inc.com
Mon Oct 27 18:29:26 CST 2003
NOTE: I thought I submitted this post over an hour ago, but
it hasn't appeared, so maybe I didn't. Here it is again (sortof).
-Tim
> On Sun, Oct 26, 2003 at 05:53:10PM -0800, Jeremy G Kahn wrote:
>
> Yikes! What an exhaustive analysis! Thanks for documenting
> the problem so thoroughly. I will *never* use a contraction
> again!
I dashed off the following program this morning (WARNING: before
any caffeine!), but it seems to work, for my writing at least.
Note that it depends on "ispell" to prompt for manual corrections
in the ambiguous cases, which simplifies the script considerably.
But beware, YMMV!
-Tim
*------------------------------------------------------------*
| Tim Maher (206) 781-UNIX (866) DOC-PERL (866) DOC-UNIX |
| tim(AT)Consultix-Inc.Com TeachMeUnix.Com TeachMePerl.Com |
*+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-*
| UNIX Fundamentals Class: 11/10-13 Perl Class: 12/01-05 |
| Watch for my Book: "Minimal Perl for Shell Programmers" |
*------------------------------------------------------------*
#! /usr/bin/perl -wlp -s
# Usage: $0 [ -s ] [-debug[=23]] F -debug # for on-screen examination
# Usage: $0 [ -s ] [-debug[=23]] F > F2; ispell F2; mv F2 F # to change file
#########################################################+
# Copyright 1997-2003, Tim Maher. All Rights Reserved #+
# POB 70563, Seattle WA 98107 USA tim at consultix-inc.com #+
#########################################################+
# -s: enables processing of "'s" endings;
# they're replaced by " ${prefix}is", to encourage hand editing
# Tricky to handle automatically, because "Let's" means "Let us",
# "Larry's got it" means "Larry has got it", etc.
our($debug,$s); # makes switches "optional"; suppresses warning if missing
BEGIN {
! defined $debug and $debug = 0; # 'cuz used in number ">"
# words missing from contractions are lower-case, so z
# not Z to get better suggestions from ispell
$prefix = 'z';
# Set to process those ending in "'s", other than "It's"
$s_okay = defined $s;
# $prefix causes "I'd" to be replaced by "I ${prefix}would";
# Triggers ispell to suggest "would", and lets user notice
# problem and correct to "I had", etc., if necessary
$on = $off = "";
if ($debug) { # for visible flagging of changes
$on = -t 1 ? `tput smso` : "{["; # terminal vs. file
$off = -t 1 ? `tput rmso` : "]}";
}
$MARKER{e} = $MARKER{c} = "";
# to identify proessing type in debugging messages
if ( $debug > 1 ) {
$MARKER{e} = 'Ending: ';
$MARKER{c} = 'Contraction: ';
}
# Maintainer: Mark word "X" in replacement as
# "?X" to flag as needing author's attention,
# when there are multiple possibilities for
# missing word. ("That's" -> "That is", "that was", etc.)
%endings = (
"s" => '?is', # Can't differentiate from possessive
"d" => '?would', # could also be had, did, etc.
"ll" => 'will',
"m" => 'am',
"re" => 'are',
"re" => 'are',
"t" => 'not',
"t" => 'not',
"ve" => 'have',
);
%contractions = (
# NOTE: "He'll, She'll, They'll" handled through %endings
# "it's" handled as exception, below
# "it's" => 'it is', # Never means possessive, despite common mistyping
"that's" => 'that ?is',
"there's" => 'there ?is',
"he's" => 'he ?is',
"here's" => 'here is',
"they're" => 'they are',
# Whose great idea was won't? Handled as exceptional case
# "won't" => 'will not',
);
}
$old = $_; # Remember original
# Can't tell possessives like "X's" from "is" case,
# so skip them by explicitly avoiding matches with " 's " if requested
# For processing of possessives/contractions ending in "'s", use -s switch
$no_s_regex = '[a-rt-z]';
$s_regex = '[a-z]';
$single_char_ending = $s_okay ? $s_regex : $no_s_regex;
# This one don't need no stinking rule-based processing!
s/\b (i)t's \b/$on$1t is$off/ixg ;
if (
# FORMAT: if ( s/RE/computed replacement/e ) { report stuff }
s/ \b (\w+) ' ($single_char_ending | [a-z]{2,}) \b /
$match = $&; # remember, so can do other matches without trashing
$part1 = $1; # remember, so can do other matches without trashing
$part2 = $2; # remember, so can do other matches without trashing
$debug > 2 and warn "Line $.: 1: $part1, 2: $part2\n";
$debug > 1 and warn "Match: [$match]\n";
if ( $match =~ m|^(c)an't$|i ) { # this is exceptional case
$debug > 1 and warn "Matched \"can't\" case\n";
$replacement = $1 . 'an not' ;
}
elsif ( $match =~ m|^(w)on't$|i ) { # exceptional case
$debug > 1 and warn "Matched \"won't\" case\n";
$replacement = $1 . 'ill not' ;
}
elsif ( defined $endings{$part2} ) {
$debug > 1 and
warn "Processing ending of '$endings{$part2}'\n";
# remove "n't" in "haven't" before adding " not", etc.
$chop = $part1 =~ m|n$| ? 1 : 0 ;
$replacement =
$MARKER{e} .
substr ($part1, 0, (( length $part1) - $chop ) ) .
" " . $endings{$part2} ;
}
elsif ( defined $contractions{$match} ) {
warn "Processing contraction for $match\n";
$replacement = "$MARKER{c}$contractions{$match}" ;
$part1 =~ m|^[A-Z]| and # match case of original
$replacement = "\l$replacement";
}
else {
$debug > 1 and warn "Skipping $match\n";
$replacement = $match ;
}
# Change ? to prefix to get ispell's attention and
# trigger appropriate correcton suggestions
$replacement =~ s|\?|$prefix|g;
$on . $replacement . $off # assert replacement value
/xegi ) {
# Following only shows last sub of group in same record, but
# that's good enough for debugging
# Reverse-video highlighting of substitutions on screen (from
# -debug) actually seems sufficient now
$debug > 1 and warn "Old string: $old\n";
$debug > 1 and warn "Changed to New string: $replacement\n";
$debug > 1 and warn "New string: $_\n";
}
-Tim
*------------------------------------------------------------*
| Tim Maher (206) 781-UNIX (866) DOC-PERL (866) DOC-UNIX |
| tim(AT)Consultix-Inc.Com TeachMeUnix.Com TeachMePerl.Com |
*+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-*
| UNIX Fundamentals Class: 11/10-13 Perl Class: 12/01-05 |
| Watch for my Book: "Minimal Perl for Shell Programmers" |
*------------------------------------------------------------*
More information about the spug-list
mailing list