SPUG: Here's my de-contraction-ating program

Tim Maher tim at consultix-inc.com
Mon Oct 27 18:29:26 CST 2003


NOTE:  I thought I submitted this post over an hour ago,  but
it hasn't appeared, so maybe I didn't.  Here it is again (sortof).
-Tim

> On Sun, Oct 26, 2003 at 05:53:10PM -0800, Jeremy G Kahn wrote:
> 
> Yikes!  What an exhaustive analysis!   Thanks for documenting
> the problem so thoroughly.  I will *never* use a contraction
> again! 

I dashed off the following program this morning (WARNING: before
any caffeine!), but it seems to work, for my writing at least.
Note that it depends on "ispell" to prompt for manual corrections
in the ambiguous cases, which simplifies the script considerably.

But beware, YMMV!

-Tim
*------------------------------------------------------------*
| Tim Maher (206) 781-UNIX  (866) DOC-PERL  (866) DOC-UNIX   |
| tim(AT)Consultix-Inc.Com  TeachMeUnix.Com  TeachMePerl.Com |
*+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-*
|  UNIX Fundamentals Class: 11/10-13   Perl Class: 12/01-05  |
|  Watch for my Book: "Minimal Perl for Shell Programmers"   |
*------------------------------------------------------------*

#! /usr/bin/perl -wlp -s
# Usage: $0 [ -s ] [-debug[=23]] F -debug # for on-screen examination
# Usage: $0 [ -s ] [-debug[=23]] F > F2; ispell F2; mv F2 F # to change file

#########################################################+
#  Copyright 1997-2003, Tim Maher. All Rights Reserved  #+
# POB 70563, Seattle WA 98107 USA tim at consultix-inc.com #+
#########################################################+

# -s: enables processing of "'s" endings;
# they're replaced by " ${prefix}is", to encourage hand editing
# Tricky to handle automatically, because "Let's" means "Let us",
#     "Larry's got it" means "Larry has got it", etc.

our($debug,$s);    # makes switches "optional"; suppresses warning if missing

BEGIN {

    ! defined $debug  and  $debug = 0;    # 'cuz used in number ">"

    # words missing from contractions are lower-case, so z
    # not Z to get better suggestions from ispell
    $prefix = 'z';

    # Set to process those ending in "'s", other than "It's"
    $s_okay = defined $s;

    # $prefix causes "I'd" to be replaced by "I ${prefix}would";
    # Triggers ispell to suggest "would", and lets user notice
    # problem and correct to "I had", etc., if necessary

    $on = $off = "";
    if ($debug) {                      # for visible flagging of changes
        $on  = -t 1 ? `tput smso` : "{[";    # terminal vs. file
        $off = -t 1 ? `tput rmso` : "]}";
    }

    $MARKER{e} = $MARKER{c} = "";
    # to identify proessing type in debugging messages
    if ( $debug > 1 ) {
        $MARKER{e} = 'Ending: ';
        $MARKER{c} = 'Contraction: ';
    }

    # Maintainer: Mark word "X" in replacement as
    # "?X" to flag as needing author's attention,
    # when there are multiple possibilities for
    # missing word. ("That's" -> "That is", "that was", etc.)

    %endings = (
        "s"  => '?is',       # Can't differentiate from possessive
        "d"  => '?would',    # could also be had, did, etc.
        "ll" => 'will',
        "m"  => 'am',
        "re" => 'are',
        "re" => 'are',
        "t"  => 'not',
        "t"  => 'not',
        "ve" => 'have',
    );

    %contractions = (
        # NOTE: "He'll, She'll, They'll" handled through %endings

        # "it's" handled as exception, below
        # "it's" => 'it is',  # Never means possessive, despite common mistyping
        "that's"  => 'that ?is',
        "there's" => 'there ?is',
        "he's"    => 'he ?is',
        "here's"    => 'here is',
        "they're"   => 'they are',

        # Whose great idea was won't?  Handled as exceptional case
        # "won't"   => 'will not',
    );
}

$old = $_;    # Remember original

# Can't tell possessives like "X's" from "is" case,
# so skip them by explicitly avoiding matches with " 's " if requested

# For processing of possessives/contractions ending in "'s", use -s switch

$no_s_regex = '[a-rt-z]';
$s_regex    = '[a-z]';

$single_char_ending = $s_okay ? $s_regex : $no_s_regex;

# This one don't need no stinking rule-based processing!
s/\b (i)t's \b/$on$1t is$off/ixg ;

if (

  # FORMAT: if ( s/RE/computed replacement/e ) { report stuff }

  s/ \b (\w+) ' ($single_char_ending | [a-z]{2,}) \b /
        $match = $&;    # remember, so can do other matches without trashing
        $part1 = $1;    # remember, so can do other matches without trashing
        $part2 = $2;    # remember, so can do other matches without trashing

        $debug > 2  and  warn "Line $.: 1: $part1, 2: $part2\n";
        $debug > 1  and  warn "Match:  [$match]\n";

        if ( $match =~ m|^(c)an't$|i ) {    # this is exceptional case
            $debug > 1  and  warn "Matched \"can't\" case\n";
            $replacement = $1 . 'an not' ;
        }
        elsif ( $match =~ m|^(w)on't$|i ) { # exceptional case
            $debug > 1  and  warn "Matched \"won't\" case\n";
            $replacement = $1 . 'ill not' ;
        }
        elsif ( defined $endings{$part2} ) {
            $debug > 1  and
                warn "Processing ending of '$endings{$part2}'\n";
            # remove "n't" in "haven't" before adding " not", etc.
            $chop = $part1 =~ m|n$| ? 1 : 0 ;
            $replacement =
                    $MARKER{e} .
                    substr ($part1, 0, (( length $part1) - $chop ) ) .
                    " " .  $endings{$part2} ;
        }
        elsif ( defined $contractions{$match} ) {
            warn "Processing contraction for $match\n";
            $replacement = "$MARKER{c}$contractions{$match}" ;
            $part1 =~ m|^[A-Z]|  and  # match case of original
                $replacement = "\l$replacement";
        }
        else {
            $debug > 1  and  warn "Skipping $match\n";
            $replacement = $match ;
        }
        # Change ? to prefix to get ispell's attention and
        # trigger appropriate correcton suggestions
        $replacement =~ s|\?|$prefix|g;

        $on . $replacement . $off   # assert replacement value
    /xegi ) {

    # Following only shows last sub of group in same record, but
    # that's good enough for debugging

    # Reverse-video highlighting of substitutions on screen (from
    # -debug) actually seems sufficient now

    $debug > 1  and  warn "Old string: $old\n";
    $debug > 1  and  warn "Changed to New string: $replacement\n";
    $debug > 1  and  warn "New string: $_\n";
}

-Tim
*------------------------------------------------------------*
| Tim Maher (206) 781-UNIX  (866) DOC-PERL  (866) DOC-UNIX   |
| tim(AT)Consultix-Inc.Com  TeachMeUnix.Com  TeachMePerl.Com |
*+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-*
|  UNIX Fundamentals Class: 11/10-13   Perl Class: 12/01-05  |
|  Watch for my Book: "Minimal Perl for Shell Programmers"   |
*------------------------------------------------------------*



More information about the spug-list mailing list