[Melbourne-pm] grep independant of newlines (Was Re: case insensitive REs)

Wed May 7 03:35:59 PDT 2008

On Wed, 7 May 2008, Tim Connors wrote:

> G'day.
> 
> I want the user to be able to supply a -i flag to my program to make 
> global case insensitive searching.

Yeehaw.

My day job always seems to come back to LaTeX code.  grepping for stuff 
that has been nicely folded at the 72 column mark is a pain, because grep 
usually looks at just the one line.  The sed & awk book had a recipe for 
phrasegrep, looking over two consequetive lines at once.  But had a few 
bugs that I worked around over the years, and if your regexp ought to 
match things over 3 lines, you were out of luck.  Well, it now works :)

Feel free to appropriate as you choose (or this is where I usually get 
told what program I should have been using instead of reinventing the 
wheel :-):

#!/usr/bin/perl -w
# -*- Mode: perl -*-

# $Revision: 1.10 $ $Date: 2008/05/07 10:27:35 $
# $Id: phrasegrep,v 1.10 2008/05/07 10:27:35 tconnors Exp $
# $Header: /home/ssi/tconnors/cvsroot/bin/phrasegrep,v 1.10 2008/05/07 10:27:35 tconnors Exp $
# $RCSfile: phrasegrep,v $

# greps for a re in files without regards for newlines.

use strict;
use warnings;
use Carp::Assert;
use Getopt::Long;
Getopt::Long::Configure ("bundling");
use Pod::Usage;

my $verbose=0;
my $debug=0;
my $colour="tty";
my $case=0;
my $greedy=0;

my $VERSION='$Revision: 1.10 $';
$VERSION=~s/\$[R]evision: ([^ ].*[^ ]) *\$/$1/;
my $DATE='$Date: 2008/05/07 10:27:35 $';
$DATE=~s/\$[D]ate: ([^ ].*[^ ]) *\$/$1/;
my $FILE='$RCSfile: phrasegrep,v $';
$FILE=~s/\$[R]CSfile: ([^ ].*[^ ]),v *\$/$1/;
my $WHAT="greps for a re in files without regards for newlines";

my (@SAVEARGV)=@ARGV;

sub isNum($) {
  ($_[0] =~ /^[+-]?\d+$/);
}

my $getOptVerbose = sub {
  my ($junk, $v)=(@_);
  $v=$verbose+1 if ($v eq "");
  die "verbosity level is not a number: $v\n" if (!isNum $v);
  $verbose=$v;
};

my $getOptDebug = sub {
  my ($junk, $d)=(@_);
  $d=$debug+1 if ($d eq "");
  die "debug level is not a number: $d\n" if (!isNum $d);
  $debug=$d;
};

my $getOptColour = sub {
  my ($junk, $c)=(@_);
  $c=1 if ($c eq "");   #could also be "tty"
  $colour=$c;
};

my ($opt_help, $opt_man, $opt_version);
my $result = GetOptions ('colour:s' => $getOptColour,
                         'debug:s' => $getOptDebug,
                         'verbose:s' => $getOptVerbose,
                         'c' => sub { $colour=1 },
                         'd' => sub { $debug++},
                         'v' => sub { $verbose++ },
                         'nocolour' => sub { $colour = 0 },
                         'i|case!' => \$case,
                         'g|greedy!' => \$greedy,
                         'help|?|h' => \$opt_help,
                         'man' => \$opt_man,
                         'version|V' => \$opt_version,
                        ) || pod2usage(2);

pod2usage(1) if ($opt_help);
pod2usage(-verbose => 2) if ($opt_man);
#pod2usage(-verbose => 0) if ($opt_version);
if ($opt_version) {
  print "$FILE ($WHAT) $VERSION ($DATE)\n";
  print "Copyright Tim Connors (2002-2008)\n";
  print "License: GPL\n";
  print "Author(s): Tim Connors <twc+nospam\@thanks+aaocbn.aao.gov.au\n";
  exit 1;
}
## Check for not enough args
pod2usage("$0: Not enough parameters.  Supply at least a regexp\n")  if (@ARGV == 0);

my $re=shift;
my $manyfiles=(@ARGV > 1);
@ARGV='-' if (!@ARGV);

my $colopen="";
my $colclose="";
if ($colour eq "tty") {
  if (-t STDOUT) {
    $colour=1 ;
  } else {
    $colour=0;
  }
}
if ($colour) {
  $colopen="\033[1;31m";
  $colclose="\033[0m";
}

print STDERR "transforming match re from '$re' to " if $verbose;
$re =~ s/ /\\s+/g;                        #spaces in the match always get 
                                          #  transformed into whitespace matches
$re =~ s/([*+])/$1?/g if !$greedy;        #use non greedy matches by default
$re = "(?i)$re" if !$case;                #case insensitive by default
$re = "($re)";
print STDERR "'$re'\n" if $verbose;

foreach my $file (@ARGV) {
  my $incfilename= $manyfiles ? "$file:" : "";
  if (!open(FH, $file)) {
    warn "can't open $file for read";
    next;
  }
  local $/;
  undef $/;  #slurp input files

  my $input = <FH>;
  $_=$input;

  #to be able to match occurences on overlapping lines, log the start
  #and end of the line where each match occurs, as well as, for
  #colouring purposes, where the matches themselves start and end
  my @nlmatch=();
  my @eolmatch=();
  my @startmatch=();
  my @endmatch=();
  while (/$re/goms) {
    #man perlretut(1): "@-" and "@+"
    push @startmatch, $-[0];
    push @endmatch, $+[0];

    my $curpos=$-[0];
    while ($curpos > 0) {
      if ((substr $input, $curpos, 1) eq "\n") {
        $curpos++;
        last;
      }
      $curpos--;
    }
    push @nlmatch, $curpos;
    $curpos=$+[0];
    while ($curpos < length($input)-1) {
      if ((substr $input, $curpos, 1) eq "\n") {
        $curpos--;
        last;
      }
      $curpos++;
    }
    push @eolmatch, $curpos;
  }

  print "nl=@nlmatch\n" if $verbose;
  print "eol=@eolmatch\n" if $verbose;
  print "s=@startmatch\n" if $verbose;
  print "e=@endmatch\n" if $verbose;

  my $curpos;
  my $length;
  $curpos=$nlmatch[0];
  foreach my $i (0.. at nlmatch) {
    #iterate through each of the matches of the regexp, and if a new
    #line, then print the start of the line to the start of the next
    #re, print the colours and that re, then print the line to the
    #next re if same line...

    if (($i>0) && (($i==@nlmatch) || ($nlmatch[$i] != $nlmatch[$i-1]))) {
      print STDERR "new line: $i " if $verbose;
      $length = $eolmatch[$i-1] - $endmatch[$i-1] + 2; #+1 to get the nl
      print STDERR "length: $length\n" if $verbose;
      print substr($input, $curpos, $length);

      $curpos=$nlmatch[$i];
    }
    last if ($i == @nlmatch);

    $length = $startmatch[$i] - $curpos;
    print substr($input, $curpos, $length);
    $curpos += $length;

    print $colopen;
    $length = $endmatch[$i] - $startmatch[$i];
    print substr($input, $curpos, $length);
    $curpos += $length;
    print $colclose;
  }

##/m -- ^/$ becomes start/end of any line
##/s may also be necessary

  #previous attempts:

  #   #this doesn't yet match multiple occurences on overlapping sets of
  #   #lines.  This makes me sad.  The third bracket somehow has to
  #   #exclude the second

  #   while (/(\n?[^\n]*?)($re)([^\n]*?\n?)/msg) {  #$case
  # #  while (/^([^\n]*?)($re)([^\n]*?)$/msg) {  #$case
  #     my $match = "$incfilename$1$colopen$2$colclose$3";
  #     print "$match";
  #   }
}

# $Log: phrasegrep,v $
# Revision 1.10  2008/05/07 10:27:35  tconnors
# licence information
#
# Revision 1.9  2008/05/07 10:24:33  tconnors
# non-greedy match by default
#
# Revision 1.8  2008/05/07 10:14:20  tconnors
# no need to transform \n in a temporary string -- already knew about matching /sm modifiers, but in this iteration of the code, couldnt quite see what I was doing
#
# Revision 1.7  2008/05/07 06:54:06  tconnors
# port to perl, and suck in the entire files at once so can compare over more than 2 lines at a time
#

__END__

=head1 NAME

phrasegrep - greps for a re in files without regards for newlines

=head1 SYNOPSIS

phrasegrep [options] <re> <files>

Options:
[--help|-?|-h]
[--man]
[--version|-V]
[--colour <yes|auto|no>|--nocolour|-c]
[--debug <level>|-d]
[--verbose <level>|-v]
[--case|-i]
[--greedy|-g]

=head1 OPTIONS

=over 8

=item B<--help|-h|-?>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=item B<--version|-V>

Prints version information and exits.

=item B<--colour {yes|auto|no}|--nocolour|-c>

STDIO uses colour always, only when STDOUT is a terminal, or never

=item B<--debug {level}|-d>

Sets or increments the debug level.  Current level is 1

=item B<--verbose {level}|-d>

Sets or increments the verbosity level.  Current level is 1

=item B<--case|-i>

Performs a case sensetive regexp search

=item B<--greedy|-g>

Performs the default perl greedy match instead of non greedy

=back

=head1 DESCRIPTION

B<phrasegrep> greps for a re in files without regards for newlines

=cut

-- 
Tim Connors