[Melbourne-pm] grep independant of newlines (Was Re: case insensitive REs)

Tim Connors tconnors at astro.swin.edu.au
Wed May 7 03:35:59 PDT 2008

On Wed, 7 May 2008, Tim Connors wrote:

> G'day.
> I want the user to be able to supply a -i flag to my program to make 
> global case insensitive searching.


My day job always seems to come back to LaTeX code.  grepping for stuff 
that has been nicely folded at the 72 column mark is a pain, because grep 
usually looks at just the one line.  The sed & awk book had a recipe for 
phrasegrep, looking over two consequetive lines at once.  But had a few 
bugs that I worked around over the years, and if your regexp ought to 
match things over 3 lines, you were out of luck.  Well, it now works :)

Feel free to appropriate as you choose (or this is where I usually get 
told what program I should have been using instead of reinventing the 
wheel :-):

#!/usr/bin/perl -w
# -*- Mode: perl -*-

# $Revision: 1.10 $ $Date: 2008/05/07 10:27:35 $
# $Id: phrasegrep,v 1.10 2008/05/07 10:27:35 tconnors Exp $
# $Header: /home/ssi/tconnors/cvsroot/bin/phrasegrep,v 1.10 2008/05/07 10:27:35 tconnors Exp $
# $RCSfile: phrasegrep,v $

# greps for a re in files without regards for newlines.

use strict;
use warnings;
use Carp::Assert;
use Getopt::Long;
Getopt::Long::Configure ("bundling");
use Pod::Usage;

my $verbose=0;
my $debug=0;
my $colour="tty";
my $case=0;
my $greedy=0;

my $VERSION='$Revision: 1.10 $';
$VERSION=~s/\$[R]evision: ([^ ].*[^ ]) *\$/$1/;
my $DATE='$Date: 2008/05/07 10:27:35 $';
$DATE=~s/\$[D]ate: ([^ ].*[^ ]) *\$/$1/;
my $FILE='$RCSfile: phrasegrep,v $';
$FILE=~s/\$[R]CSfile: ([^ ].*[^ ]),v *\$/$1/;
my $WHAT="greps for a re in files without regards for newlines";


sub isNum($) {
  ($_[0] =~ /^[+-]?\d+$/);

my $getOptVerbose = sub {
  my ($junk, $v)=(@_);
  $v=$verbose+1 if ($v eq "");
  die "verbosity level is not a number: $v\n" if (!isNum $v);

my $getOptDebug = sub {
  my ($junk, $d)=(@_);
  $d=$debug+1 if ($d eq "");
  die "debug level is not a number: $d\n" if (!isNum $d);

my $getOptColour = sub {
  my ($junk, $c)=(@_);
  $c=1 if ($c eq "");   #could also be "tty"

my ($opt_help, $opt_man, $opt_version);
my $result = GetOptions ('colour:s' => $getOptColour,
                         'debug:s' => $getOptDebug,
                         'verbose:s' => $getOptVerbose,
                         'c' => sub { $colour=1 },
                         'd' => sub { $debug++},
                         'v' => sub { $verbose++ },
                         'nocolour' => sub { $colour = 0 },
                         'i|case!' => \$case,
                         'g|greedy!' => \$greedy,
                         'help|?|h' => \$opt_help,
                         'man' => \$opt_man,
                         'version|V' => \$opt_version,
                        ) || pod2usage(2);

pod2usage(1) if ($opt_help);
pod2usage(-verbose => 2) if ($opt_man);
#pod2usage(-verbose => 0) if ($opt_version);
if ($opt_version) {
  print "$FILE ($WHAT) $VERSION ($DATE)\n";
  print "Copyright Tim Connors (2002-2008)\n";
  print "License: GPL\n";
  print "Author(s): Tim Connors <twc+nospam\@thanks+aaocbn.aao.gov.au\n";
  exit 1;
## Check for not enough args
pod2usage("$0: Not enough parameters.  Supply at least a regexp\n")  if (@ARGV == 0);

my $re=shift;
my $manyfiles=(@ARGV > 1);
@ARGV='-' if (!@ARGV);

my $colopen="";
my $colclose="";
if ($colour eq "tty") {
  if (-t STDOUT) {
    $colour=1 ;
  } else {
if ($colour) {

print STDERR "transforming match re from '$re' to " if $verbose;
$re =~ s/ /\\s+/g;                        #spaces in the match always get 
                                          #  transformed into whitespace matches
$re =~ s/([*+])/$1?/g if !$greedy;        #use non greedy matches by default
$re = "(?i)$re" if !$case;                #case insensitive by default
$re = "($re)";
print STDERR "'$re'\n" if $verbose;

foreach my $file (@ARGV) {
  my $incfilename= $manyfiles ? "$file:" : "";
  if (!open(FH, $file)) {
    warn "can't open $file for read";
  local $/;
  undef $/;  #slurp input files

  my $input = <FH>;

  #to be able to match occurences on overlapping lines, log the start
  #and end of the line where each match occurs, as well as, for
  #colouring purposes, where the matches themselves start and end
  my @nlmatch=();
  my @eolmatch=();
  my @startmatch=();
  my @endmatch=();
  while (/$re/goms) {
    #man perlretut(1): "@-" and "@+"
    push @startmatch, $-[0];
    push @endmatch, $+[0];

    my $curpos=$-[0];
    while ($curpos > 0) {
      if ((substr $input, $curpos, 1) eq "\n") {
    push @nlmatch, $curpos;
    while ($curpos < length($input)-1) {
      if ((substr $input, $curpos, 1) eq "\n") {
    push @eolmatch, $curpos;

  print "nl=@nlmatch\n" if $verbose;
  print "eol=@eolmatch\n" if $verbose;
  print "s=@startmatch\n" if $verbose;
  print "e=@endmatch\n" if $verbose;

  my $curpos;
  my $length;
  foreach my $i (0.. at nlmatch) {
    #iterate through each of the matches of the regexp, and if a new
    #line, then print the start of the line to the start of the next
    #re, print the colours and that re, then print the line to the
    #next re if same line...

    if (($i>0) && (($i==@nlmatch) || ($nlmatch[$i] != $nlmatch[$i-1]))) {
      print STDERR "new line: $i " if $verbose;
      $length = $eolmatch[$i-1] - $endmatch[$i-1] + 2; #+1 to get the nl
      print STDERR "length: $length\n" if $verbose;
      print substr($input, $curpos, $length);

    last if ($i == @nlmatch);

    $length = $startmatch[$i] - $curpos;
    print substr($input, $curpos, $length);
    $curpos += $length;

    print $colopen;
    $length = $endmatch[$i] - $startmatch[$i];
    print substr($input, $curpos, $length);
    $curpos += $length;
    print $colclose;

##/m -- ^/$ becomes start/end of any line
##/s may also be necessary

  #previous attempts:

  #   #this doesn't yet match multiple occurences on overlapping sets of
  #   #lines.  This makes me sad.  The third bracket somehow has to
  #   #exclude the second

  #   while (/(\n?[^\n]*?)($re)([^\n]*?\n?)/msg) {  #$case
  # #  while (/^([^\n]*?)($re)([^\n]*?)$/msg) {  #$case
  #     my $match = "$incfilename$1$colopen$2$colclose$3";
  #     print "$match";
  #   }

# $Log: phrasegrep,v $
# Revision 1.10  2008/05/07 10:27:35  tconnors
# licence information
# Revision 1.9  2008/05/07 10:24:33  tconnors
# non-greedy match by default
# Revision 1.8  2008/05/07 10:14:20  tconnors
# no need to transform \n in a temporary string -- already knew about matching /sm modifiers, but in this iteration of the code, couldnt quite see what I was doing
# Revision 1.7  2008/05/07 06:54:06  tconnors
# port to perl, and suck in the entire files at once so can compare over more than 2 lines at a time


=head1 NAME

phrasegrep - greps for a re in files without regards for newlines


phrasegrep [options] <re> <files>

[--colour <yes|auto|no>|--nocolour|-c]
[--debug <level>|-d]
[--verbose <level>|-v]

=head1 OPTIONS

=over 8

=item B<--help|-h|-?>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=item B<--version|-V>

Prints version information and exits.

=item B<--colour {yes|auto|no}|--nocolour|-c>

STDIO uses colour always, only when STDOUT is a terminal, or never

=item B<--debug {level}|-d>

Sets or increments the debug level.  Current level is 1

=item B<--verbose {level}|-d>

Sets or increments the verbosity level.  Current level is 1

=item B<--case|-i>

Performs a case sensetive regexp search

=item B<--greedy|-g>

Performs the default perl greedy match instead of non greedy



B<phrasegrep> greps for a re in files without regards for newlines


Tim Connors

More information about the Melbourne-pm mailing list