From Justin.Crawford at cusys.edu Wed Jul 25 17:35:12 2001 From: Justin.Crawford at cusys.edu (Justin Crawford) Date: Wed Aug 4 23:58:33 2004 Subject: [boulder.pm] text extract Message-ID: Hi all- I'm trying to extract multiple lines of data from a text file, only if one of the lines contains a string. Picture a file like so: a haystack haystack b a rough DIAMOND! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 1861 bytes Desc: not available Url : http://mail.pm.org/archives/boulder-pm/attachments/20010725/3300c801/attachment.bin From Justin.Crawford at cusys.edu Wed Jul 25 17:39:02 2001 From: Justin.Crawford at cusys.edu (Justin Crawford) Date: Wed Aug 4 23:58:33 2004 Subject: [boulder.pm] FW: text extract Message-ID: Whoa, got ahead of myself there... I'm trying to extract multiple lines of data from a text file, only if one of the lines contains a string. Picture a file like so: 1a haystack haystack haystack haystack 1b 2a haystack haystack NEEDLE!!! haystack 2b I want to cruise the text file getting every chunk that's like the one from 2a to 2b. What's the best way? Thanks! Justin Crawford Oracle DBA Team University of Colorado Management Systems 303-492-9083 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 2069 bytes Desc: not available Url : http://mail.pm.org/archives/boulder-pm/attachments/20010725/14457f61/attachment.bin From chip at rmpg.org Wed Jul 25 17:50:23 2001 From: chip at rmpg.org (Chip Atkinson) Date: Wed Aug 4 23:58:33 2004 Subject: [boulder.pm] FW: text extract In-Reply-To: Message-ID: While perhaps not the best way, here's a way at least $start_looking = 0; while (<>) { if (/2a/) { $start_looking = 1; next; } if ($start_looking && /NEEDLE/) { print ("Found it\n"); exit; } } Another possibility is to read in the entire file in slurp mode and look for a pattern like /2a.*NEEDLE.*/. Chip On Wed, 25 Jul 2001, Justin Crawford wrote: > Whoa, got ahead of myself there... > > I'm trying to extract multiple lines of data from a text file, only if one > of the lines contains a string. Picture a file like so: > > 1a > haystack > haystack > haystack > haystack > 1b > > 2a > haystack > haystack > NEEDLE!!! > haystack > 2b > > I want to cruise the text file getting every chunk that's like the one from > 2a to 2b. > > What's the best way? > > Thanks! > > Justin Crawford > Oracle DBA Team > University of Colorado Management Systems > 303-492-9083 > From jvanslyk at matchlogic.com Wed Jul 25 17:52:55 2001 From: jvanslyk at matchlogic.com (Jason Van Slyke) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract Message-ID: <5FE9B713CCCDD311A03400508B8B30130AB7FB4B@bdr-xcln.corp.matchlogic.com> that first statement is a true Perlism! jvs -----Original Message----- From: Chip Atkinson [mailto:chip@rmpg.org] Sent: Wednesday, July 25, 2001 4:50 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: Re: [boulder.pm] FW: text extract While perhaps not the best way, here's a way at least $start_looking = 0; while (<>) { if (/2a/) { $start_looking = 1; next; } if ($start_looking && /NEEDLE/) { print ("Found it\n"); exit; } } Another possibility is to read in the entire file in slurp mode and look for a pattern like /2a.*NEEDLE.*/. Chip On Wed, 25 Jul 2001, Justin Crawford wrote: > Whoa, got ahead of myself there... > > I'm trying to extract multiple lines of data from a text file, only if one > of the lines contains a string. Picture a file like so: > > 1a > haystack > haystack > haystack > haystack > 1b > > 2a > haystack > haystack > NEEDLE!!! > haystack > 2b > > I want to cruise the text file getting every chunk that's like the one from > 2a to 2b. > > What's the best way? > > Thanks! > > Justin Crawford > Oracle DBA Team > University of Colorado Management Systems > 303-492-9083 > From Justin.Crawford at cusys.edu Wed Jul 25 18:02:38 2001 From: Justin.Crawford at cusys.edu (Justin Crawford) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract Message-ID: Thanks Chip. My first example was misleading. It's more like: - haystack haystack haystack ; - haystack haystack NEEDLE! haystack ; Still, I can see how that first solution could do it; I'll make that one go. I was thinking there must be some way to do it using the range operators (...) in combination with another pattern match, or using undef $/, but I can't figure that way out if it exists. Probably I'm trying too hard to be superelite and not trying hard enough to get the thing writ... Justin -----Original Message----- From: Chip Atkinson [mailto:chip@rmpg.org] Sent: Wednesday, July 25, 2001 3:50 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: Re: [boulder.pm] FW: text extract While perhaps not the best way, here's a way at least $start_looking = 0; while (<>) { if (/2a/) { $start_looking = 1; next; } if ($start_looking && /NEEDLE/) { print ("Found it\n"); exit; } } Another possibility is to read in the entire file in slurp mode and look for a pattern like /2a.*NEEDLE.*/. Chip On Wed, 25 Jul 2001, Justin Crawford wrote: > Whoa, got ahead of myself there... > > I'm trying to extract multiple lines of data from a text file, only if one > of the lines contains a string. Picture a file like so: > > 1a > haystack > haystack > haystack > haystack > 1b > > 2a > haystack > haystack > NEEDLE!!! > haystack > 2b > > I want to cruise the text file getting every chunk that's like the one from > 2a to 2b. > > What's the best way? > > Thanks! > > Justin Crawford > Oracle DBA Team > University of Colorado Management Systems > 303-492-9083 > From Jay.Kominek at colorado.edu Wed Jul 25 18:23:36 2001 From: Jay.Kominek at colorado.edu (Jay Kominek) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract In-Reply-To: Message-ID: On Wed, 25 Jul 2001, Justin Crawford wrote: > Thanks Chip. My first example was misleading. It's more like: > > - > haystack > haystack > haystack > ; > > - > haystack > haystack > NEEDLE! > haystack > ; undef $/; $data = <>; $data =~ /^-.+?NEEDLE!.+?;$/sm; Hopefully you can modify that to narrow down what matched, or where it was matched, as needed. The other possibility appears as though it might be to split the entire file on \n\n and then grep each element of the returned array for NEEDLE! - Jay Kominek If you can't do it in Perl, it probably isn't worth doing. From jvanslyk at matchlogic.com Wed Jul 25 18:37:33 2001 From: jvanslyk at matchlogic.com (Jason Van Slyke) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract Message-ID: <5FE9B713CCCDD311A03400508B8B30130AB7FB4D@bdr-xcln.corp.matchlogic.com> Justin, I was thinking you might want to use an array to hold each input line and set the array index on the start flag and examine each value of the array after the end flag so you could truly keep track of every line between the flags: while (<>) { if (/-/) { $index=0 ; next ; } else if (/;/) { foreach(@inray) { if (/NEEDLE/) { print NEWFILE @inray ; last ; # should jump out of foreach but stay in the while (<>) loop; } } } $inray[$index] = $/ ; $index + 1 ; } Sorry, I'm at home and don't have my normal access to Learning Perl or the CookBook so I might have screwed up some of the syntax 'cause I don't get to write nearly enough Perl. But I think the logic would work. jvs -----Original Message----- From: Justin Crawford [mailto:Justin.Crawford@cusys.edu] Sent: Wednesday, July 25, 2001 5:03 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: RE: [boulder.pm] FW: text extract Thanks Chip. My first example was misleading. It's more like: - haystack haystack haystack ; - haystack haystack NEEDLE! haystack ; Still, I can see how that first solution could do it; I'll make that one go. I was thinking there must be some way to do it using the range operators (...) in combination with another pattern match, or using undef $/, but I can't figure that way out if it exists. Probably I'm trying too hard to be superelite and not trying hard enough to get the thing writ... Justin -----Original Message----- From: Chip Atkinson [mailto:chip@rmpg.org] Sent: Wednesday, July 25, 2001 3:50 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: Re: [boulder.pm] FW: text extract While perhaps not the best way, here's a way at least $start_looking = 0; while (<>) { if (/2a/) { $start_looking = 1; next; } if ($start_looking && /NEEDLE/) { print ("Found it\n"); exit; } } Another possibility is to read in the entire file in slurp mode and look for a pattern like /2a.*NEEDLE.*/. Chip On Wed, 25 Jul 2001, Justin Crawford wrote: > Whoa, got ahead of myself there... > > I'm trying to extract multiple lines of data from a text file, only if one > of the lines contains a string. Picture a file like so: > > 1a > haystack > haystack > haystack > haystack > 1b > > 2a > haystack > haystack > NEEDLE!!! > haystack > 2b > > I want to cruise the text file getting every chunk that's like the one from > 2a to 2b. > > What's the best way? > > Thanks! > > Justin Crawford > Oracle DBA Team > University of Colorado Management Systems > 303-492-9083 > From boulder-pm at jim-baker.com Wed Jul 25 21:01:18 2001 From: boulder-pm at jim-baker.com (Jim Baker) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract In-Reply-To: Message-ID: Justin, It's certainly perfectly valid to go for a super lite approach. More or less. My preferred way for this sort of problem is to digest the text, then apply a powerful state machine (regex) against the digest. The advantage is you can specify more complex needles, such that the first haystack has to have two silver needles in it, followed by a haystack with a golden needle, bracketed by haystacks containing red needles. Or any other interesting "sentence" that can be expressed in Perl's expansive regex grammar. For example, you could use this for analyzing intrusion detection traces. - Jim use strict; use warnings; my $needle = shift; my @data = <>; # Construct digest of data my $digest; foreach my $row (@data) { if ($row =~ /^-\s*$/) { $digest .= '-'; } elsif ($row =~ /^;\s*$/) { $digest .= ';'; } elsif ($row =~ /^\s*$/) { $digest .= ' '; } elsif ($row =~ /$needle/o) { $digest .= 'N'; } else { $digest .= 'x'; } } # Now look for our needle, and any data surrounding it print STDERR "Looking for '$needle' in digest '$digest'\n"; if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles my $start = $-[0]; # @- is the beginning offsets of the captures my $end = $+[0] - 1; # @+ and this is the end foreach my $i ($start .. $end) { print $data[$i]; } } else { die "Needle not found"; } -----Original Message----- From: owner-boulder-pm-list@pm.org [mailto:owner-boulder-pm-list@pm.org]On Behalf Of Justin Crawford Sent: Wednesday, July 25, 2001 5:03 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: RE: [boulder.pm] FW: text extract Thanks Chip. My first example was misleading. It's more like: - haystack haystack haystack ; - haystack haystack NEEDLE! haystack ; Still, I can see how that first solution could do it; I'll make that one go. I was thinking there must be some way to do it using the range operators (...) in combination with another pattern match, or using undef $/, but I can't figure that way out if it exists. Probably I'm trying too hard to be superelite and not trying hard enough to get the thing writ... Justin -----Original Message----- From: Chip Atkinson [mailto:chip@rmpg.org] Sent: Wednesday, July 25, 2001 3:50 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: Re: [boulder.pm] FW: text extract While perhaps not the best way, here's a way at least $start_looking = 0; while (<>) { if (/2a/) { $start_looking = 1; next; } if ($start_looking && /NEEDLE/) { print ("Found it\n"); exit; } } Another possibility is to read in the entire file in slurp mode and look for a pattern like /2a.*NEEDLE.*/. Chip On Wed, 25 Jul 2001, Justin Crawford wrote: > Whoa, got ahead of myself there... > > I'm trying to extract multiple lines of data from a text file, only if one > of the lines contains a string. Picture a file like so: > > 1a > haystack > haystack > haystack > haystack > 1b > > 2a > haystack > haystack > NEEDLE!!! > haystack > 2b > > I want to cruise the text file getting every chunk that's like the one from > 2a to 2b. > > What's the best way? > > Thanks! > > Justin Crawford > Oracle DBA Team > University of Colorado Management Systems > 303-492-9083 > From porterje at us.ibm.com Thu Jul 26 07:25:13 2001 From: porterje at us.ibm.com (Jessee Porter) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract Message-ID: Hi, Justin. If you know that your end of record delimiter is always going to be a lone semi-colon, I'd do something similar to, the following. Changing $/ to semi-colon + newline ensures that we read one record at a time... { local $/=";\n"; while () { print $_,"\n" if /NEEDLE!/; } } Some people have also mentioned reading the entire file into an array or scalar, which is fine, too. TIMTOWDI and all. Be careful of doing this on exceptionally large files, though, as perl will eat all your memory. regards, Jesse Porter Justin Crawford @pm.org on 07/25/2001 05:02:38 PM Please respond to boulder-pm-list@happyfunball.pm.org Sent by: owner-boulder-pm-list@pm.org To: "'boulder-pm-list@happyfunball.pm.org'" cc: Subject: RE: [boulder.pm] FW: text extract Thanks Chip. My first example was misleading. It's more like: - haystack haystack haystack ; - haystack haystack NEEDLE! haystack ; Still, I can see how that first solution could do it; I'll make that one go. I was thinking there must be some way to do it using the range operators (...) in combination with another pattern match, or using undef $/, but I can't figure that way out if it exists. Probably I'm trying too hard to be superelite and not trying hard enough to get the thing writ... Justin -----Original Message----- From: Chip Atkinson [mailto:chip@rmpg.org] Sent: Wednesday, July 25, 2001 3:50 PM To: 'boulder-pm-list@happyfunball.pm.org' Subject: Re: [boulder.pm] FW: text extract While perhaps not the best way, here's a way at least $start_looking = 0; while (<>) { if (/2a/) { $start_looking = 1; next; } if ($start_looking && /NEEDLE/) { print ("Found it\n"); exit; } } Another possibility is to read in the entire file in slurp mode and look for a pattern like /2a.*NEEDLE.*/. Chip On Wed, 25 Jul 2001, Justin Crawford wrote: > Whoa, got ahead of myself there... > > I'm trying to extract multiple lines of data from a text file, only if one > of the lines contains a string. Picture a file like so: > > 1a > haystack > haystack > haystack > haystack > 1b > > 2a > haystack > haystack > NEEDLE!!! > haystack > 2b > > I want to cruise the text file getting every chunk that's like the one from > 2a to 2b. > > What's the best way? > > Thanks! > > Justin Crawford > Oracle DBA Team > University of Colorado Management Systems > 303-492-9083 > From Justin.Crawford at cusys.edu Thu Jul 26 11:24:54 2001 From: Justin.Crawford at cusys.edu (Justin Crawford) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract Message-ID: Thanks for all the suggestions, everyone. I knew the problem had probably been considered before by better coders than me. I have a script to get the chunks I'm after now. Side note (new-b ?): Jim, I couldn't get your solution to work. It looks like fun though. These are the 2 lines that lose me: my $start = $-[0]; # @- is the beginning offsets of the captures my $end = $+[0] - 1; # @+ and this is the end I just can't figure out what's going on. Output of the script is like: Looking for 'NEEDLE' in digest '-xxxxxxxxxxx; -xxxxxxNxxxx;-xxxxxxNxxxx;-xxxxxxxxxx;-xN;' Use of uninitialized value at fileR.pl line 23. Use of uninitialized value at fileR.pl line 24. Use of uninitialized value at fileR.pl line 24. @+ isn't initialized. I've never seen a regular array named like that before, so I guessed that it's a special variable (along with @-). But neither's listed in my perl books, so maybe they're just regular arrays that I need to fill up? What's their story, where do they come from, what should they be initialized to in this context? Thanks again, Justin ----------- use strict; use warnings; my $needle = shift; my @data = <>; # Construct digest of data my $digest; foreach my $row (@data) { if ($row =~ /^-\s*$/) { $digest .= '-'; } elsif ($row =~ /^;\s*$/) { $digest .= ';'; } elsif ($row =~ /^\s*$/) { $digest .= ' '; } elsif ($row =~ /$needle/o) { $digest .= 'N'; } else { $digest .= 'x'; } } # Now look for our needle, and any data surrounding it print STDERR "Looking for '$needle' in digest '$digest'\n"; if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles my $start = $-[0]; # @- is the beginning offsets of the captures my $end = $+[0] - 1; # @+ and this is the end foreach my $i ($start .. $end) { print $data[$i]; } } else { die "Needle not found"; } From boulder-pm at jim-baker.com Thu Jul 26 12:23:51 2001 From: boulder-pm at jim-baker.com (Jim Baker) Date: Wed Aug 4 23:58:34 2004 Subject: [boulder.pm] FW: text extract In-Reply-To: Message-ID: Justin, Checking the CHANGES log for 5.6.0, @+ and @- appeared, courtesy of Ilya, in PATCH 5.004_76, and it appeared it Perl 5.005_52. @+ and @- are documented in "The Camel", 3rd Ed., which I *highly* recommend if you want to have fun with those funky things called Perl regexes. - Jim -----Original Message----- From: owner-boulder-pm-list@pm.org [mailto:owner-boulder-pm-list@pm.org]On Behalf Of Justin Crawford Sent: Thursday, July 26, 2001 10:25 AM To: boulder-pm-list@happyfunball.pm.org Subject: RE: [boulder.pm] FW: text extract Thanks for all the suggestions, everyone. I knew the problem had probably been considered before by better coders than me. I have a script to get the chunks I'm after now. Side note (new-b ?): Jim, I couldn't get your solution to work. It looks like fun though. These are the 2 lines that lose me: my $start = $-[0]; # @- is the beginning offsets of the captures my $end = $+[0] - 1; # @+ and this is the end I just can't figure out what's going on. Output of the script is like: Looking for 'NEEDLE' in digest '-xxxxxxxxxxx; -xxxxxxNxxxx;-xxxxxxNxxxx;-xxxxxxxxxx;-xN;' Use of uninitialized value at fileR.pl line 23. Use of uninitialized value at fileR.pl line 24. Use of uninitialized value at fileR.pl line 24. @+ isn't initialized. I've never seen a regular array named like that before, so I guessed that it's a special variable (along with @-). But neither's listed in my perl books, so maybe they're just regular arrays that I need to fill up? What's their story, where do they come from, what should they be initialized to in this context? Thanks again, Justin ----------- use strict; use warnings; my $needle = shift; my @data = <>; # Construct digest of data my $digest; foreach my $row (@data) { if ($row =~ /^-\s*$/) { $digest .= '-'; } elsif ($row =~ /^;\s*$/) { $digest .= ';'; } elsif ($row =~ /^\s*$/) { $digest .= ' '; } elsif ($row =~ /$needle/o) { $digest .= 'N'; } else { $digest .= 'x'; } } # Now look for our needle, and any data surrounding it print STDERR "Looking for '$needle' in digest '$digest'\n"; if ($digest =~ /(-x*Nx*;)/) { # modify for more complex needles my $start = $-[0]; # @- is the beginning offsets of the captures my $end = $+[0] - 1; # @+ and this is the end foreach my $i ($start .. $end) { print $data[$i]; } } else { die "Needle not found"; }