[Melbourne-pm] grep independant of newlines (Was Re: case insensitive REs)
Tim Connors
tconnors at astro.swin.edu.au
Wed May 7 03:35:59 PDT 2008
On Wed, 7 May 2008, Tim Connors wrote:
> G'day.
>
> I want the user to be able to supply a -i flag to my program to make
> global case insensitive searching.
Yeehaw.
My day job always seems to come back to LaTeX code. grepping for stuff
that has been nicely folded at the 72 column mark is a pain, because grep
usually looks at just the one line. The sed & awk book had a recipe for
phrasegrep, looking over two consequetive lines at once. But had a few
bugs that I worked around over the years, and if your regexp ought to
match things over 3 lines, you were out of luck. Well, it now works :)
Feel free to appropriate as you choose (or this is where I usually get
told what program I should have been using instead of reinventing the
wheel :-):
#!/usr/bin/perl -w
# -*- Mode: perl -*-
# $Revision: 1.10 $ $Date: 2008/05/07 10:27:35 $
# $Id: phrasegrep,v 1.10 2008/05/07 10:27:35 tconnors Exp $
# $Header: /home/ssi/tconnors/cvsroot/bin/phrasegrep,v 1.10 2008/05/07 10:27:35 tconnors Exp $
# $RCSfile: phrasegrep,v $
# greps for a re in files without regards for newlines.
use strict;
use warnings;
use Carp::Assert;
use Getopt::Long;
Getopt::Long::Configure ("bundling");
use Pod::Usage;
my $verbose=0;
my $debug=0;
my $colour="tty";
my $case=0;
my $greedy=0;
my $VERSION='$Revision: 1.10 $';
$VERSION=~s/\$[R]evision: ([^ ].*[^ ]) *\$/$1/;
my $DATE='$Date: 2008/05/07 10:27:35 $';
$DATE=~s/\$[D]ate: ([^ ].*[^ ]) *\$/$1/;
my $FILE='$RCSfile: phrasegrep,v $';
$FILE=~s/\$[R]CSfile: ([^ ].*[^ ]),v *\$/$1/;
my $WHAT="greps for a re in files without regards for newlines";
my (@SAVEARGV)=@ARGV;
sub isNum($) {
($_[0] =~ /^[+-]?\d+$/);
}
my $getOptVerbose = sub {
my ($junk, $v)=(@_);
$v=$verbose+1 if ($v eq "");
die "verbosity level is not a number: $v\n" if (!isNum $v);
$verbose=$v;
};
my $getOptDebug = sub {
my ($junk, $d)=(@_);
$d=$debug+1 if ($d eq "");
die "debug level is not a number: $d\n" if (!isNum $d);
$debug=$d;
};
my $getOptColour = sub {
my ($junk, $c)=(@_);
$c=1 if ($c eq ""); #could also be "tty"
$colour=$c;
};
my ($opt_help, $opt_man, $opt_version);
my $result = GetOptions ('colour:s' => $getOptColour,
'debug:s' => $getOptDebug,
'verbose:s' => $getOptVerbose,
'c' => sub { $colour=1 },
'd' => sub { $debug++},
'v' => sub { $verbose++ },
'nocolour' => sub { $colour = 0 },
'i|case!' => \$case,
'g|greedy!' => \$greedy,
'help|?|h' => \$opt_help,
'man' => \$opt_man,
'version|V' => \$opt_version,
) || pod2usage(2);
pod2usage(1) if ($opt_help);
pod2usage(-verbose => 2) if ($opt_man);
#pod2usage(-verbose => 0) if ($opt_version);
if ($opt_version) {
print "$FILE ($WHAT) $VERSION ($DATE)\n";
print "Copyright Tim Connors (2002-2008)\n";
print "License: GPL\n";
print "Author(s): Tim Connors <twc+nospam\@thanks+aaocbn.aao.gov.au\n";
exit 1;
}
## Check for not enough args
pod2usage("$0: Not enough parameters. Supply at least a regexp\n") if (@ARGV == 0);
my $re=shift;
my $manyfiles=(@ARGV > 1);
@ARGV='-' if (!@ARGV);
my $colopen="";
my $colclose="";
if ($colour eq "tty") {
if (-t STDOUT) {
$colour=1 ;
} else {
$colour=0;
}
}
if ($colour) {
$colopen="\033[1;31m";
$colclose="\033[0m";
}
print STDERR "transforming match re from '$re' to " if $verbose;
$re =~ s/ /\\s+/g; #spaces in the match always get
# transformed into whitespace matches
$re =~ s/([*+])/$1?/g if !$greedy; #use non greedy matches by default
$re = "(?i)$re" if !$case; #case insensitive by default
$re = "($re)";
print STDERR "'$re'\n" if $verbose;
foreach my $file (@ARGV) {
my $incfilename= $manyfiles ? "$file:" : "";
if (!open(FH, $file)) {
warn "can't open $file for read";
next;
}
local $/;
undef $/; #slurp input files
my $input = <FH>;
$_=$input;
#to be able to match occurences on overlapping lines, log the start
#and end of the line where each match occurs, as well as, for
#colouring purposes, where the matches themselves start and end
my @nlmatch=();
my @eolmatch=();
my @startmatch=();
my @endmatch=();
while (/$re/goms) {
#man perlretut(1): "@-" and "@+"
push @startmatch, $-[0];
push @endmatch, $+[0];
my $curpos=$-[0];
while ($curpos > 0) {
if ((substr $input, $curpos, 1) eq "\n") {
$curpos++;
last;
}
$curpos--;
}
push @nlmatch, $curpos;
$curpos=$+[0];
while ($curpos < length($input)-1) {
if ((substr $input, $curpos, 1) eq "\n") {
$curpos--;
last;
}
$curpos++;
}
push @eolmatch, $curpos;
}
print "nl=@nlmatch\n" if $verbose;
print "eol=@eolmatch\n" if $verbose;
print "s=@startmatch\n" if $verbose;
print "e=@endmatch\n" if $verbose;
my $curpos;
my $length;
$curpos=$nlmatch[0];
foreach my $i (0.. at nlmatch) {
#iterate through each of the matches of the regexp, and if a new
#line, then print the start of the line to the start of the next
#re, print the colours and that re, then print the line to the
#next re if same line...
if (($i>0) && (($i==@nlmatch) || ($nlmatch[$i] != $nlmatch[$i-1]))) {
print STDERR "new line: $i " if $verbose;
$length = $eolmatch[$i-1] - $endmatch[$i-1] + 2; #+1 to get the nl
print STDERR "length: $length\n" if $verbose;
print substr($input, $curpos, $length);
$curpos=$nlmatch[$i];
}
last if ($i == @nlmatch);
$length = $startmatch[$i] - $curpos;
print substr($input, $curpos, $length);
$curpos += $length;
print $colopen;
$length = $endmatch[$i] - $startmatch[$i];
print substr($input, $curpos, $length);
$curpos += $length;
print $colclose;
}
##/m -- ^/$ becomes start/end of any line
##/s may also be necessary
#previous attempts:
# #this doesn't yet match multiple occurences on overlapping sets of
# #lines. This makes me sad. The third bracket somehow has to
# #exclude the second
# while (/(\n?[^\n]*?)($re)([^\n]*?\n?)/msg) { #$case
# # while (/^([^\n]*?)($re)([^\n]*?)$/msg) { #$case
# my $match = "$incfilename$1$colopen$2$colclose$3";
# print "$match";
# }
}
# $Log: phrasegrep,v $
# Revision 1.10 2008/05/07 10:27:35 tconnors
# licence information
#
# Revision 1.9 2008/05/07 10:24:33 tconnors
# non-greedy match by default
#
# Revision 1.8 2008/05/07 10:14:20 tconnors
# no need to transform \n in a temporary string -- already knew about matching /sm modifiers, but in this iteration of the code, couldnt quite see what I was doing
#
# Revision 1.7 2008/05/07 06:54:06 tconnors
# port to perl, and suck in the entire files at once so can compare over more than 2 lines at a time
#
__END__
=head1 NAME
phrasegrep - greps for a re in files without regards for newlines
=head1 SYNOPSIS
phrasegrep [options] <re> <files>
Options:
[--help|-?|-h]
[--man]
[--version|-V]
[--colour <yes|auto|no>|--nocolour|-c]
[--debug <level>|-d]
[--verbose <level>|-v]
[--case|-i]
[--greedy|-g]
=head1 OPTIONS
=over 8
=item B<--help|-h|-?>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=item B<--version|-V>
Prints version information and exits.
=item B<--colour {yes|auto|no}|--nocolour|-c>
STDIO uses colour always, only when STDOUT is a terminal, or never
=item B<--debug {level}|-d>
Sets or increments the debug level. Current level is 1
=item B<--verbose {level}|-d>
Sets or increments the verbosity level. Current level is 1
=item B<--case|-i>
Performs a case sensetive regexp search
=item B<--greedy|-g>
Performs the default perl greedy match instead of non greedy
=back
=head1 DESCRIPTION
B<phrasegrep> greps for a re in files without regards for newlines
=cut
--
Tim Connors
More information about the Melbourne-pm
mailing list