[boulder.pm] FW: text extract

Jim Baker boulder-pm at jim-baker.com
Wed Jul 25 21:01:18 CDT 2001


Justin,

It's certainly perfectly valid to go for a super lite approach.  More or
less.  My preferred way for this sort of problem is to digest the text, then
apply a powerful state machine (regex) against the digest.  The advantage is
you can specify more complex needles, such that the first haystack has to
have two silver needles in it, followed by a haystack with a golden needle,
bracketed by haystacks containing red needles.  Or any other interesting
"sentence" that can be expressed in Perl's expansive regex grammar.  For
example, you could use this for analyzing intrusion detection traces.

- Jim

use strict;
use warnings;

my $needle = shift;
my @data = <>;

# Construct digest of data
my $digest;
foreach my $row (@data) {
    if ($row =~ /^-\s*$/) { $digest .= '-'; }
    elsif ($row =~ /^;\s*$/) { $digest .= ';'; }
    elsif ($row =~ /^\s*$/) { $digest .= ' '; }
    elsif ($row =~ /$needle/o) { $digest .= 'N'; }
    else { $digest .= 'x'; }
}

# Now look for our needle, and any data surrounding it
print STDERR "Looking for '$needle' in digest '$digest'\n";
if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles
    my $start = $-[0];   # @- is the beginning offsets of the captures
    my $end = $+[0] - 1; # @+ and this is the end
    foreach my $i ($start .. $end) {
	print $data[$i];
    }
}
else {
    die "Needle not found";
}


-----Original Message-----
From: owner-boulder-pm-list at pm.org
[mailto:owner-boulder-pm-list at pm.org]On Behalf Of Justin Crawford
Sent: Wednesday, July 25, 2001 5:03 PM
To: 'boulder-pm-list at happyfunball.pm.org'
Subject: RE: [boulder.pm] FW: text extract


Thanks Chip.  My first example was misleading.  It's more like:

-
haystack
haystack
haystack
;

-
haystack
haystack
NEEDLE!
haystack
;

Still, I can see how that first solution could do it; I'll make that one go.
I was thinking there must be some way to do it using the range operators
(...) in combination with another pattern match, or using undef $/, but I
can't figure that way out if it exists.  Probably I'm trying too hard to be
superelite and not trying hard enough to get the thing writ...

Justin

-----Original Message-----
From: Chip Atkinson [mailto:chip at rmpg.org]
Sent: Wednesday, July 25, 2001 3:50 PM
To: 'boulder-pm-list at happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract


While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
>
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string.  Picture a file like so:
>
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
>
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
>
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
>
> What's the best way?
>
> Thanks!
>
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
>




More information about the Boulder-pm mailing list