From Justin.Crawford at cusys.edu  Wed Jul 25 17:35:12 2001
From: Justin.Crawford at cusys.edu (Justin Crawford)
Date: Wed Aug  4 23:58:33 2004
Subject: [boulder.pm] text extract
Message-ID: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3A@exchange.cusys.edu>

Hi all-

I'm trying to extract multiple lines of data from a text file, only if one
of the lines contains a string.  Picture a file like so:

a
haystack
haystack
b

a
rough
DIAMOND!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 1861 bytes
Desc: not available
Url : http://mail.pm.org/archives/boulder-pm/attachments/20010725/3300c801/attachment.bin
From Justin.Crawford at cusys.edu  Wed Jul 25 17:39:02 2001
From: Justin.Crawford at cusys.edu (Justin Crawford)
Date: Wed Aug  4 23:58:33 2004
Subject: [boulder.pm] FW: text extract
Message-ID: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3B@exchange.cusys.edu>

Whoa, got ahead of myself there...

I'm trying to extract multiple lines of data from a text file, only if one
of the lines contains a string.  Picture a file like so:

1a
haystack
haystack
haystack
haystack
1b

2a
haystack
haystack
NEEDLE!!!
haystack
2b

I want to cruise the text file getting every chunk that's like the one from
2a to 2b.

What's the best way?

Thanks!

Justin Crawford
Oracle DBA Team
University of Colorado Management Systems
303-492-9083
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 2069 bytes
Desc: not available
Url : http://mail.pm.org/archives/boulder-pm/attachments/20010725/14457f61/attachment.bin
From chip at rmpg.org  Wed Jul 25 17:50:23 2001
From: chip at rmpg.org (Chip Atkinson)
Date: Wed Aug  4 23:58:33 2004
Subject: [boulder.pm] FW: text extract
In-Reply-To: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3B@exchange.cusys.edu>
Message-ID: <Pine.LNX.4.10.10107251645001.24897-100000@lilpup.pupman.com>

While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
> 
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string.  Picture a file like so:
> 
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
> 
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
> 
> I want to cruise the text file getting every chunk that's like the one from
> 2a to 2b.
> 
> What's the best way?
> 
> Thanks!
> 
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
> 


From jvanslyk at matchlogic.com  Wed Jul 25 17:52:55 2001
From: jvanslyk at matchlogic.com (Jason Van Slyke)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
Message-ID: <5FE9B713CCCDD311A03400508B8B30130AB7FB4B@bdr-xcln.corp.matchlogic.com>

that first statement is a true Perlism!
jvs

-----Original Message-----
From: Chip Atkinson [mailto:chip@rmpg.org]
Sent: Wednesday, July 25, 2001 4:50 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract


While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
> 
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string.  Picture a file like so:
> 
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
> 
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
> 
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
> 
> What's the best way?
> 
> Thanks!
> 
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
> 

From Justin.Crawford at cusys.edu  Wed Jul 25 18:02:38 2001
From: Justin.Crawford at cusys.edu (Justin Crawford)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
Message-ID: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3C@exchange.cusys.edu>

Thanks Chip.  My first example was misleading.  It's more like:

-
haystack
haystack
haystack
;

-
haystack
haystack
NEEDLE!
haystack
;

Still, I can see how that first solution could do it; I'll make that one go.
I was thinking there must be some way to do it using the range operators
(...) in combination with another pattern match, or using undef $/, but I
can't figure that way out if it exists.  Probably I'm trying too hard to be
superelite and not trying hard enough to get the thing writ...

Justin

-----Original Message-----
From: Chip Atkinson [mailto:chip@rmpg.org]
Sent: Wednesday, July 25, 2001 3:50 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract


While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
> 
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string.  Picture a file like so:
> 
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
> 
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
> 
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
> 
> What's the best way?
> 
> Thanks!
> 
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
> 

From Jay.Kominek at colorado.edu  Wed Jul 25 18:23:36 2001
From: Jay.Kominek at colorado.edu (Jay Kominek)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
In-Reply-To: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3C@exchange.cusys.edu>
Message-ID: <Pine.GSO.4.33.0107251716530.1656-100000@ucsub.colorado.edu>


On Wed, 25 Jul 2001, Justin Crawford wrote:

> Thanks Chip.  My first example was misleading.  It's more like:
>
> -
> haystack
> haystack
> haystack
> ;
>
> -
> haystack
> haystack
> NEEDLE!
> haystack
> ;

undef $/;
$data = <>;
$data =~ /^-.+?NEEDLE!.+?;$/sm;

Hopefully you can modify that to narrow down what matched, or where it was
matched, as needed.

The other possibility appears as though it might be to split the entire
file on \n\n and then grep each element of the returned array for NEEDLE!

- Jay Kominek <jay.kominek@colorado.edu>
  If you can't do it in Perl,
  it probably isn't worth doing.


From jvanslyk at matchlogic.com  Wed Jul 25 18:37:33 2001
From: jvanslyk at matchlogic.com (Jason Van Slyke)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
Message-ID: <5FE9B713CCCDD311A03400508B8B30130AB7FB4D@bdr-xcln.corp.matchlogic.com>

Justin,

I was thinking you might want to use an array to hold each input line and
set the array index on the start flag and examine each value of the array
after the end flag so you could truly keep track of every line between the
flags:

while (<>)
{
	if (/-/)
	{
		$index=0 ;
		next ;
	}
	else if (/;/)
	{
		foreach(@inray)
		{
			if (/NEEDLE/)
			{
				print NEWFILE @inray ;
				last ; # should jump out of foreach but stay
in the while (<>) loop;
			}
		}
	}

	$inray[$index] = $/ ;
	$index + 1 ;
}

Sorry, I'm at home and don't have my normal access to Learning Perl or the
CookBook so I might have screwed up  some of the syntax 'cause I don't get
to write nearly enough Perl. But I think the logic would work.

jvs
-----Original Message-----
From: Justin Crawford [mailto:Justin.Crawford@cusys.edu]
Sent: Wednesday, July 25, 2001 5:03 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: RE: [boulder.pm] FW: text extract


Thanks Chip.  My first example was misleading.  It's more like:

-
haystack
haystack
haystack
;

-
haystack
haystack
NEEDLE!
haystack
;

Still, I can see how that first solution could do it; I'll make that one go.
I was thinking there must be some way to do it using the range operators
(...) in combination with another pattern match, or using undef $/, but I
can't figure that way out if it exists.  Probably I'm trying too hard to be
superelite and not trying hard enough to get the thing writ...

Justin

-----Original Message-----
From: Chip Atkinson [mailto:chip@rmpg.org]
Sent: Wednesday, July 25, 2001 3:50 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract


While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
> 
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string.  Picture a file like so:
> 
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
> 
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
> 
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
> 
> What's the best way?
> 
> Thanks!
> 
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
> 

From boulder-pm at jim-baker.com  Wed Jul 25 21:01:18 2001
From: boulder-pm at jim-baker.com (Jim Baker)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
In-Reply-To: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3C@exchange.cusys.edu>
Message-ID: <MNELJHCOIJFCJNJNIADCEEDBCMAA.boulder-pm@jim-baker.com>

Justin,

It's certainly perfectly valid to go for a super lite approach.  More or
less.  My preferred way for this sort of problem is to digest the text, then
apply a powerful state machine (regex) against the digest.  The advantage is
you can specify more complex needles, such that the first haystack has to
have two silver needles in it, followed by a haystack with a golden needle,
bracketed by haystacks containing red needles.  Or any other interesting
"sentence" that can be expressed in Perl's expansive regex grammar.  For
example, you could use this for analyzing intrusion detection traces.

- Jim

use strict;
use warnings;

my $needle = shift;
my @data = <>;

# Construct digest of data
my $digest;
foreach my $row (@data) {
    if ($row =~ /^-\s*$/) { $digest .= '-'; }
    elsif ($row =~ /^;\s*$/) { $digest .= ';'; }
    elsif ($row =~ /^\s*$/) { $digest .= ' '; }
    elsif ($row =~ /$needle/o) { $digest .= 'N'; }
    else { $digest .= 'x'; }
}

# Now look for our needle, and any data surrounding it
print STDERR "Looking for '$needle' in digest '$digest'\n";
if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles
    my $start = $-[0];   # @- is the beginning offsets of the captures
    my $end = $+[0] - 1; # @+ and this is the end
    foreach my $i ($start .. $end) {
	print $data[$i];
    }
}
else {
    die "Needle not found";
}


-----Original Message-----
From: owner-boulder-pm-list@pm.org
[mailto:owner-boulder-pm-list@pm.org]On Behalf Of Justin Crawford
Sent: Wednesday, July 25, 2001 5:03 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: RE: [boulder.pm] FW: text extract


Thanks Chip.  My first example was misleading.  It's more like:

-
haystack
haystack
haystack
;

-
haystack
haystack
NEEDLE!
haystack
;

Still, I can see how that first solution could do it; I'll make that one go.
I was thinking there must be some way to do it using the range operators
(...) in combination with another pattern match, or using undef $/, but I
can't figure that way out if it exists.  Probably I'm trying too hard to be
superelite and not trying hard enough to get the thing writ...

Justin

-----Original Message-----
From: Chip Atkinson [mailto:chip@rmpg.org]
Sent: Wednesday, July 25, 2001 3:50 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract


While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
>
> I'm trying to extract multiple lines of data from a text file, only if one
> of the lines contains a string.  Picture a file like so:
>
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
>
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
>
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
>
> What's the best way?
>
> Thanks!
>
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
>


From porterje at us.ibm.com  Thu Jul 26 07:25:13 2001
From: porterje at us.ibm.com (Jessee Porter)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
Message-ID: <OF5DE2A493.C2838413-ON87256A95.0043CBA4@boulder.ibm.com>


Hi, Justin.

If you know that your end of record delimiter is always going to be a lone
semi-colon, I'd do
something similar to, the following. Changing $/ to semi-colon + newline
ensures that we
read one record at a time...

{
    local $/=";\n";
    while (<FH>) {
        print $_,"\n" if /NEEDLE!/;
    }
}

Some people have also mentioned reading the entire file into an array or
scalar, which is fine, too.
TIMTOWDI and all. Be careful of doing this on exceptionally large files,
though, as perl will eat all
your memory.

     regards,
     Jesse Porter

Justin Crawford <Justin.Crawford@cusys.edu>@pm.org on 07/25/2001 05:02:38
PM

Please respond to boulder-pm-list@happyfunball.pm.org

Sent by:  owner-boulder-pm-list@pm.org


To:   "'boulder-pm-list@happyfunball.pm.org'"
      <boulder-pm-list@happyfunball.pm.org>
cc:
Subject:  RE: [boulder.pm] FW: text extract


Thanks Chip.  My first example was misleading.  It's more like:

-
haystack
haystack
haystack
;

-
haystack
haystack
NEEDLE!
haystack
;

Still, I can see how that first solution could do it; I'll make that one
go.
I was thinking there must be some way to do it using the range operators
(...) in combination with another pattern match, or using undef $/, but I
can't figure that way out if it exists.  Probably I'm trying too hard to be
superelite and not trying hard enough to get the thing writ...

Justin

-----Original Message-----
From: Chip Atkinson [mailto:chip@rmpg.org]
Sent: Wednesday, July 25, 2001 3:50 PM
To: 'boulder-pm-list@happyfunball.pm.org'
Subject: Re: [boulder.pm] FW: text extract


While perhaps not the best way, here's a way at least

$start_looking = 0;

while (<>)
{
   if (/2a/)
   {
     $start_looking = 1;
     next;
   }

   if ($start_looking && /NEEDLE/)
   {
     print ("Found it\n");
     exit;
   }
}

Another possibility is to read in the entire file in slurp mode and look
for a pattern like /2a.*NEEDLE.*/.

Chip

On Wed, 25 Jul 2001, Justin Crawford wrote:

> Whoa, got ahead of myself there...
>
> I'm trying to extract multiple lines of data from a text file, only if
one
> of the lines contains a string.  Picture a file like so:
>
> 1a
> haystack
> haystack
> haystack
> haystack
> 1b
>
> 2a
> haystack
> haystack
> NEEDLE!!!
> haystack
> 2b
>
> I want to cruise the text file getting every chunk that's like the one
from
> 2a to 2b.
>
> What's the best way?
>
> Thanks!
>
> Justin Crawford
> Oracle DBA Team
> University of Colorado Management Systems
> 303-492-9083
>


From Justin.Crawford at cusys.edu  Thu Jul 26 11:24:54 2001
From: Justin.Crawford at cusys.edu (Justin Crawford)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
Message-ID: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3E@exchange.cusys.edu>

Thanks for all the suggestions, everyone.  I knew the problem had probably
been considered before by better coders than me.  I have a script to get the
chunks I'm after now.

Side note (new-b ?):

Jim, I couldn't get your solution to work.  It looks like fun though.  These
are the 2 lines that lose me:

    my $start = $-[0];   # @- is the beginning offsets of the captures
    my $end = $+[0] - 1; # @+ and this is the end

I just can't figure out what's going on.  Output of the script is like:

Looking for 'NEEDLE' in digest '-xxxxxxxxxxx;
-xxxxxxNxxxx;-xxxxxxNxxxx;-xxxxxxxxxx;-xN;'
Use of uninitialized value at fileR.pl line 23.
Use of uninitialized value at fileR.pl line 24.
Use of uninitialized value at fileR.pl line 24.

@+ isn't initialized.  I've never seen a regular array named like that
before, so I guessed that it's a special variable (along with @-).  But
neither's listed in my perl books, so maybe they're just regular arrays that
I need to fill up?  What's their story, where do they come from, what should
they be initialized to in this context?

Thanks again,

Justin

-----------
use strict;
use warnings;

my $needle = shift;
my @data = <>;

# Construct digest of data
my $digest;
foreach my $row (@data) {
    if ($row =~ /^-\s*$/) { $digest .= '-'; }
    elsif ($row =~ /^;\s*$/) { $digest .= ';'; }
    elsif ($row =~ /^\s*$/) { $digest .= ' '; }
    elsif ($row =~ /$needle/o) { $digest .= 'N'; }
    else { $digest .= 'x'; }
}

# Now look for our needle, and any data surrounding it
print STDERR "Looking for '$needle' in digest '$digest'\n";
if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles
    my $start = $-[0];   # @- is the beginning offsets of the captures
    my $end = $+[0] - 1; # @+ and this is the end
    foreach my $i ($start .. $end) {
	print $data[$i];
    }
}
else {
    die "Needle not found";
}

From boulder-pm at jim-baker.com  Thu Jul 26 12:23:51 2001
From: boulder-pm at jim-baker.com (Jim Baker)
Date: Wed Aug  4 23:58:34 2004
Subject: [boulder.pm] FW: text extract
In-Reply-To: <A1329B9A3F28D411AFAB00A0C9E3B002F82E3E@exchange.cusys.edu>
Message-ID: <MNELJHCOIJFCJNJNIADCKEDLCMAA.boulder-pm@jim-baker.com>

Justin,

Checking the CHANGES log for 5.6.0, @+ and @- appeared, courtesy of Ilya, in
PATCH 5.004_76, and it appeared it Perl 5.005_52.  @+ and @- are documented
in "The Camel", 3rd Ed., which I *highly* recommend if you want to have fun
with those funky things called Perl regexes.

- Jim


-----Original Message-----
From: owner-boulder-pm-list@pm.org
[mailto:owner-boulder-pm-list@pm.org]On Behalf Of Justin Crawford
Sent: Thursday, July 26, 2001 10:25 AM
To: boulder-pm-list@happyfunball.pm.org
Subject: RE: [boulder.pm] FW: text extract


Thanks for all the suggestions, everyone.  I knew the problem had probably
been considered before by better coders than me.  I have a script to get the
chunks I'm after now.

Side note (new-b ?):

Jim, I couldn't get your solution to work.  It looks like fun though.  These
are the 2 lines that lose me:

    my $start = $-[0];   # @- is the beginning offsets of the captures
    my $end = $+[0] - 1; # @+ and this is the end

I just can't figure out what's going on.  Output of the script is like:

Looking for 'NEEDLE' in digest '-xxxxxxxxxxx;
-xxxxxxNxxxx;-xxxxxxNxxxx;-xxxxxxxxxx;-xN;'
Use of uninitialized value at fileR.pl line 23.
Use of uninitialized value at fileR.pl line 24.
Use of uninitialized value at fileR.pl line 24.

@+ isn't initialized.  I've never seen a regular array named like that
before, so I guessed that it's a special variable (along with @-).  But
neither's listed in my perl books, so maybe they're just regular arrays that
I need to fill up?  What's their story, where do they come from, what should
they be initialized to in this context?

Thanks again,

Justin

-----------
use strict;
use warnings;

my $needle = shift;
my @data = <>;

# Construct digest of data
my $digest;
foreach my $row (@data) {
    if ($row =~ /^-\s*$/) { $digest .= '-'; }
    elsif ($row =~ /^;\s*$/) { $digest .= ';'; }
    elsif ($row =~ /^\s*$/) { $digest .= ' '; }
    elsif ($row =~ /$needle/o) { $digest .= 'N'; }
    else { $digest .= 'x'; }
}

# Now look for our needle, and any data surrounding it
print STDERR "Looking for '$needle' in digest '$digest'\n";
if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles
    my $start = $-[0];   # @- is the beginning offsets of the captures
    my $end = $+[0] - 1; # @+ and this is the end
    foreach my $i ($start .. $end) {
	print $data[$i];
    }
}
else {
    die "Needle not found";
}