[boulder.pm] FW: text extract
Jim Baker
boulder-pm at jim-baker.com
Wed Jul 25 21:01:18 CDT 2001
Justin,
It's certainly perfectly valid to go for a super lite approach. More or
less. My preferred way for this sort of problem is to digest the text, then
apply a powerful state machine (regex) against the digest. The advantage is
you can specify more complex needles, such that the first haystack has to
have two silver needles in it, followed by a haystack with a golden needle,
bracketed by haystacks containing red needles. Or any other interesting
"sentence" that can be expressed in Perl's expansive regex grammar. For
example, you could use this for analyzing intrusion detection traces.
- Jim
use strict;
use warnings;
my $needle = shift;
my @data = <>;
# Construct digest of data
my $digest;
foreach my $row (@data) {
if ($row =~ /^-\s*$/) { $digest .= '-'; }
elsif ($row =~ /^;\s*$/) { $digest .= ';'; }
elsif ($row =~ /^\s*$/) { $digest .= ' '; }
elsif ($row =~ /$needle/o) { $digest .= 'N'; }
else { $digest .= 'x'; }
}
# Now look for our needle, and any data surrounding it
print STDERR "Looking for '$needle' in digest '$digest'\n";
if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles
my $start = $-[0]; # @- is the beginning offsets of the captures
my $end = $+[0] - 1; # @+ and this is the end
foreach my $i ($start .. $end) {
print $data[$i];
}
}
else {
die "Needle not found";
}
-----Original Message-----
From: owner-boulder-pm-list at pm.org
[mailto:owner-boulder-pm-list at pm.org]On Behalf Of Justin Crawford
Sent: Wednesday, July 25, 2001 5:03 PM
To: 'boulder-pm-list at happyfunball.pm.org'
Subject: RE: [boulder.pm] FW: text extract
Thanks Chip. My first example was misleading. It's more like:
-
haystack
haystack
haystack
;
-
haystack
haystack
NEEDLE!
haystack
;
Still, I can see how that first solution could do it; I'll make that one go.
I was thinking there must be some way to do it using the range operators
(...) in combination with another pattern match, or using undef $/, but I
can't figure that way out if it exists. Probably I'm trying too hard to be
superelite and not trying hard enough to get the thing writ...
Justin
-----Original Message-----
From: Chip Atkinson [mailto:chip at rmpg.org]
Sent: Wednesday, July 25, 2001 3:50 PM
To: 'boulder-pm-list at happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract
While perhaps not the best way, here's a way at least
$start_looking = 0;
while (<>)
{
if (/2a/)
{
$start_looking = 1;
next;
}
if ($start_looking && /NEEDLE/)
{
print ("Found it\n");
exit;
}
}
Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.
Chip
On Wed, 25 Jul 2001, Justin Crawford wrote:
> Whoa, got ahead of myself there...
>
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string. Picture a file like so:
>
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
>
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
>
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
>
> What's the best way?
>
> Thanks!
>
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
>
More information about the Boulder-pm
mailing list