From martin_jacobs at optusnet.com.au Thu Apr 12 01:07:47 2007 From: martin_jacobs at optusnet.com.au (Martin Jacobs) Date: Thu, 12 Apr 2007 18:07:47 +1000 Subject: [Brisbane-pm] Regex Syntax Message-ID: Hi folks, I've got a clumsy way to do regexes, and I'm looking for a better way. I need to read the data from a file (called $name), which has the following format... Date Rainfall mm 01/01/2000 00:00:00 27.333 01/01/2000 00:05:00 0.0 01/01/2000 02:00:00 29.15 01/01/2000 02:05:30 0.0 The way I've been doing it to date is open (RF, "<", "$name") or die "PERRMOSS could not open file: $name!"; my @file = ; chomp @file; Which cuts of whatever newline character there is on each line, followed by for $i (1..$#file){ $file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+):(\d+)\s*(\d.\d*)|) { ($day,$month,$year,$hour,$minute,$second,$rain) = ($1,$2,$3,$4,$5, $6,$7); } } To accommodate some variations in the input record, I have expanded this to for $i (1..$#file){ if ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+):(\d+)\s*(\d. \d*)|) { ($day,$month,$year,$hour,$minute,$second,$rain) = ($1,$2,$3,$4,$5, $6,$7); } elsif ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+)\s*(\d. \d*)|) { ($day,$month,$year,$hour,$minute,$rain) = ($1,$2,$3,$4,$5,$6); $second = 0;} elsif ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+)\s*(\d.\d*)|) { ($day,$month,$year,$hour,$rain) = ($1,$2,$3,$4,$5); ($second,$minute) = (0,0); } elsif ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s*(\d.\d*)|) { ($day,$month,$year,$rain) = ($1,$2,$3,$4); ($second,$minute,$hour) = (0,0,0); } else {print_to [$Screen,$summary], " PERRMOSS cannot read rainfall data file $name near line $i PERRMOSS aborted at: \t\t$fulltime\n\n"; exit;} } The problem is that the first value of $rain should equal 27.333, but it equals 27. So, there's a syntax issue, and i would be grateful for any hints. In terms of the bigger picture, is using $1,$2 etc the best way to do it? Regards, Martin Visit my website... http://web.mac.com/martin_jacobs1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/pipermail/brisbane-pm/attachments/20070412/e68c4bc5/attachment.html From djames at thehub.com.au Thu Apr 12 15:01:43 2007 From: djames at thehub.com.au (Damian James) Date: Fri, 13 Apr 2007 08:01:43 +1000 Subject: [Brisbane-pm] Regex Syntax In-Reply-To: References: Message-ID: <7ECD6941-3A4A-40FB-92C2-AED5DEDB990C@thehub.com.au> On 12/04/2007, at 6:07 PM, Martin Jacobs wrote: > Hi folks, > > I've got a clumsy way to do regexes, and I'm looking for a better way. > > I need to read the data from a file (called $name), which has the > following format... > > Date Rainfall mm > 01/01/2000 00:00:00 27.333 > 01/01/2000 00:05:00 0.0 > 01/01/2000 02:00:00 29.15 > 01/01/2000 02:05:30 0.0 > Hmm, you could just use split() first on the while space, then on the /s and :s > Which cuts of whatever newline character there is on each line, > followed by > > for $i (1..$#file){ > $file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+):(\d+)\s*(\d. > \d*)|) { > ($day,$month,$year,$hour,$minute,$second,$rain) = ($1,$2,$3,$4,$5, > $6,$7); > } > } You make an index counter, but don't really need it. How about: Also note that if your match *fails*, those number variables will still contain their last successful match. This is probably bad. for my $line ( @file ) { my @values = m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+):(\d+)\s*(\d. \d*)|); } > To accommodate some variations in the input record, I have expanded > this to > > for $i (1..$#file){ > if ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+):(\d+)\s*(\d. > \d*)|) { > ($day,$month,$year,$hour,$minute,$second,$rain) = ($1,$2,$3,$4,$5, > $6,$7); > } > elsif ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+):(\d+)\s*(\d. > \d*)|) { > ($day,$month,$year,$hour,$minute,$rain) = ($1,$2,$3,$4,$5,$6); > $second = 0;} > elsif ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s+(\d+)\s*(\d.\d*)|) { > ($day,$month,$year,$hour,$rain) = ($1,$2,$3,$4,$5); > ($second,$minute) = (0,0); > } > elsif ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s*(\d.\d*)|) { > ($day,$month,$year,$rain) = ($1,$2,$3,$4); > ($second,$minute,$hour) = (0,0,0); > } > else {print_to [$Screen,$summary], " > PERRMOSS cannot read rainfall data file $name near line $i > PERRMOSS aborted at: \t\t$fulltime\n\n"; > exit;} > } > Testing pattern matches is good - since you never assign values from number variables unless there's really been a match. > > > The problem is that the first value of $rain should equal 27.333, > but it equals 27. So, there's a syntax issue, and i would be > grateful for any hints. > It's the pattern. To match that part of the string, you have \d.\d* This matches: 1 digit, followed by one "any character", followed by zero or more digits. 27.333 does not match this, but 27 does. You probably wanted \d+\.?\d* or somthing like it. > In terms of the bigger picture, is using $1,$2 etc the best way to > do it? I suggest never to use them unless you have tested for a successful match - though you are doing that here (just getting an incorrect match) Directly assignng the matches like I have done is okay, because it doesn't return values unless there was a match. Of course, you then need to test whether you got the values you were expecting :) In your case though I'd just use split() my ($date, $time, $rain) = split '\w+', $line; my ($year, $month, $day = split '/', $date; my ($hour, $minute, $second) = split ':', $time; Though I'd use a hash to hold al those names - then push a referece to it onto an array: open FILE, $name or die $!; my @records; for my $line () { my %temp; $temp{ qw/ date time rain /) = split '\w+', $line; $temp{ qw/ year month day /) = split '/', $line; $temp{ qw/ hour minute second /) = split ':', $line; push $records, \%temp } close FILE or die $!; Then for any given record $i, $records[$i]->{rain} has the $rain value. Cheers, Damian From jarich at perltraining.com.au Thu Apr 12 18:54:37 2007 From: jarich at perltraining.com.au (Jacinta Richardson) Date: Fri, 13 Apr 2007 11:54:37 +1000 Subject: [Brisbane-pm] Regex Syntax In-Reply-To: <7ECD6941-3A4A-40FB-92C2-AED5DEDB990C@thehub.com.au> References: <7ECD6941-3A4A-40FB-92C2-AED5DEDB990C@thehub.com.au> Message-ID: <461EE2DD.6080201@perltraining.com.au> Damian James wrote: >> Date Rainfall mm >> 01/01/2000 00:00:00 27.333 >> 01/01/2000 00:05:00 0.0 >> 01/01/2000 02:00:00 29.15 >> 01/01/2000 02:05:30 0.0 >> > > Hmm, you could just use split() first on the while space, then on > the /s and :s As Damian has said, split() is definitely the way to go. It's going to be much less error prone and very easy for everyone to understand. My solution is slightly different from Damian's (untested, but should be close): open FILE, "<", $name or die $!; my @records; while() { my ($date, $time, $rainfall) = split (/ /, $_); my ($day, $month, $year) = split("/", $date); # Sometimes we're not given time at all my ($hours, $min, $sec) = (0,0,0); if( !$rainfall ) { $rainfall = $time; } else { ($hours, $min, $sec) = split(":", $time); } push @records, { day => $day, month => $month, year => $year, hour => $hours, minute => $minutes || 0, # sometimes no minutes second => $second || 0, # sometimes no seconds rain => $rain, }; } Using a while loop as opposed to a foreach loop, or sucking the whole array into an array will enhance memory efficiency (in most cases). If you're planning to do something with the records as soon as you have the data, then obviously you don't need to store that information. > You make an index counter, but don't really need it. How about: > Also note that if your match *fails*, those number variables will > still contain their last successful match. This is probably bad. This doesn't always happen, but it can happen and is bad when it does. >> In terms of the bigger picture, is using $1,$2 etc the best way to >> do it? If you can avoid using $1, $2, $3 etc in favour of your own variable names then that's probably a good idea. For example: if ($file[$i] =~ m|(\d+)\/(\d+)\/(\d{4})\s*(\d.\d*)|) { ($day,$month,$year,$rain) = ($1,$2,$3,$4); ($second,$minute,$hour) = (0,0,0); } could be rewritten: if( ($day, $month, $year, $rain) = ( $file[$i] =~ m|(\d+)/(\d+)/(\d{4})\s*(\d.\d*)| ) { ($second,$minute,$hour) = (0,0,0); } merely by capturing the matches in list context. I still recommend split for this kind of problem. all the best, Jacinta From jarich at perltraining.com.au Thu Apr 12 20:21:15 2007 From: jarich at perltraining.com.au (Jacinta Richardson) Date: Fri, 13 Apr 2007 13:21:15 +1000 Subject: [Brisbane-pm] Regex Syntax In-Reply-To: <200704130236.l3D2aMoB011759@wraith.its.griffith.edu.au> References: <200704130236.l3D2aMoB011759@wraith.its.griffith.edu.au> Message-ID: <461EF72B.3090801@perltraining.com.au> Anthony Thyssen wrote: > Jacinta Richardson on wrote... > | open FILE, "<", $name or die $!; > | > | my @records; > | while() { > | my ($date, $time, $rainfall) = split (/ /, $_); > | > | my ($day, $month, $year) = split("/", $date); > | > | # Sometimes we're not given time at all > | my ($hours, $min, $sec) = (0,0,0); > | if( !$rainfall ) { > | $rainfall = $time; > | } > | else { > | ($hours, $min, $sec) = split(":", $time); > | } > | > | > | push @records, { > | day => $day, > | month => $month, > | year => $year, > | hour => $hours, > | minute => $minutes || 0, # sometimes no minutes > | second => $second || 0, # sometimes no seconds > | rain => $rain, > | }; > | } > | > Hmmm... Small bug in the above. > > If time was not given. then hours is set to the rainfall, > and also needs to be reset to 0. How would that happen? If time is not given then the value for $rainfall will have been put into $time. So we do: if( !$rainfall ) { $rainfall = $time; } and hours remains untouched (thus set to 0 from the initialiser above). Have I missed something? Further, the first split should probably be: split (/\s+/, $_); and if the rainfall can ever be zero we should probably write: if( not defined $rainfall ) { $rainfall = $time; } All the best, Jacinta -- ("`-''-/").___..--''"`-._ | Jacinta Richardson | `6_ 6 ) `-. ( ).`-.__.`) | Perl Training Australia | (_Y_.)' ._ ) `._ `. ``-..-' | +61 3 9354 6001 | _..`--'_..-_/ /--'_.' ,' | contact at perltraining.com.au | (il),-'' (li),' ((!.-' | www.perltraining.com.au | From martin_jacobs at optusnet.com.au Thu Apr 12 21:12:42 2007 From: martin_jacobs at optusnet.com.au (Martin Jacobs) Date: Fri, 13 Apr 2007 14:12:42 +1000 Subject: [Brisbane-pm] Regex Syntax 2 Message-ID: <126FA247-E6CF-40D3-9F99-1F82CF2DF891@optusnet.com.au> Thanks for your replies. Getting rid of the $1, $2 stuff helps a lot. I had also spotted the 'bug' when the hours:minute:time was missing, but I've got a bit a simple logic to get round it (set $hours = 0, and $rain = $time). I think that's enough to keep me going for now. Regards, Martin Visit my website... http://web.mac.com/martin_jacobs1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/pipermail/brisbane-pm/attachments/20070413/95048d00/attachment.html From martin_jacobs at optusnet.com.au Sun Apr 15 21:33:48 2007 From: martin_jacobs at optusnet.com.au (Martin Jacobs) Date: Mon, 16 Apr 2007 14:33:48 +1000 Subject: [Brisbane-pm] Still Struggling With Regex Syntax Message-ID: Hi folks, I still don't get it (quite). The story so far... I have split my rainfall record, and the right-most scalar is $rain. I now want to make sure that $rain holds a numeric value, without any whitespace, newlines or other nasties. If I try $rain = ($rain =~ m|\d+\.?\d*|); It returns 1, which, presumably, is the number of times it gets a match. What's the right syntax for saying 'make $rain contain the numeric values in $rain"? Regards, Martin Visit my website... http://web.mac.com/martin_jacobs1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/pipermail/brisbane-pm/attachments/20070416/1ed8d3c8/attachment.html From David.Bussenschutt at qmtechnologies.com Sun Apr 15 21:49:52 2007 From: David.Bussenschutt at qmtechnologies.com (David Bussenschutt) Date: Mon, 16 Apr 2007 14:49:52 +1000 Subject: [Brisbane-pm] Still Struggling With Regex Syntax Message-ID: <19F217C6E2CA304CBDFE4D8CB16CA14B0392DEE7@exch-b01.qmtechnologies.com> Martin, How about this: # this example contains numbers, whitespace and other irrelevant stuff, possibly newlines my $rain = " 345678 xxx\nblah"; #magic: $rain =~ s/^.*?(\d+).*$/$1/s; # my explanation: # the basic substitution operator looks like this: s///; but with extra characters between the slashes as required below.... # ^.*? means match anything, starting from the start of the string (non-greedy, so it doesn't ever match a number) # (\d+) means match any continual section of numbers, and the brackets means remember them as $1 # .*$ means match anything, finishing at the end string # the $1 means "replace the entire matched string with just the bit remembered in $1" (ie the numbers). # the 's' at the end means don't treat newlines as special ( http://perldoc.perl.org/perlre.html ) #NOTE: this code gives you left-most string of consecutive numbers in any string. eg if $rain started as 'xxx1234yyy5678' it would end up as just '1234' print $rain; # prints the string '345678' -----Original Message----- From: brisbane-pm-bounces+david.bussenschutt=qmtechnologies.com at pm.org [mailto:brisbane-pm-bounces+david.bussenschutt=qmtechnologies.com at pm.org]On Behalf Of Martin Jacobs Sent: Monday, 16 April 2007 2:34 PM To: Brisbane Perl Group Subject: [Brisbane-pm] Still Struggling With Regex Syntax Hi folks, I still don't get it (quite). The story so far... I have split my rainfall record, and the right-most scalar is $rain. I now want to make sure that $rain holds a numeric value, without any whitespace, newlines or other nasties. If I try $rain = ($rain =~ m|\d+\.?\d*|); It returns 1, which, presumably, is the number of times it gets a match. What's the right syntax for saying 'make $rain contain the numeric values in $rain"? Regards, Martin Visit my website... http://web.mac.com/martin_jacobs1 The message and any attachment is confidential and may be privileged or otherwise protected from disclosure. If you have received it by mistake please let us know by reply and then delete it from your system; you should not copy the message or disclose its contents to anyone. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/pipermail/brisbane-pm/attachments/20070416/e93f492b/attachment.html From jarich at perltraining.com.au Sun Apr 15 21:54:40 2007 From: jarich at perltraining.com.au (Jacinta Richardson) Date: Mon, 16 Apr 2007 14:54:40 +1000 Subject: [Brisbane-pm] Still Struggling With Regex Syntax In-Reply-To: References: Message-ID: <46230190.8090704@perltraining.com.au> Martin Jacobs wrote: > $rain = ($rain =~ m|\d+\.?\d*|); > > It returns 1, which, presumably, is the number of times it gets a match. Almost there: ($rain) = ($rain =~ m| ( # start capturing \d+ # some digits, 1 or more (:? # open grouping braces (non-capturing) \. # a dot \d+ # one or more digits )? # close grouping: make optional ) # stop capturing |x ); Just like when you're capturing for $1 you need to use parentheses inside your regular expression. A regular expression returns the number of matches made in scalar context, and the match results in list context. Thus to get the actual match, you need to capture into a list. So you need the parentheses on the outside as well. If you want to allow numbers like: "13." then change the second \d+ to \d* If you want to insist that rain _only_ contain this match (so that you can reject invalid lines), then anchor the expression to the start and end: ($rain) = ($rain =~ m| ^ # start of string ( # start capturing \d+ # some digits, 1 or more (:? # open grouping braces (non-capturing) \. # a dot \d+ # one or more digits )? # close grouping: make optional ) # stop capturing \s*$ # optional whitespace, followed by end of string |x ); It's the "x" at the end, which is allowing me to add comments and arbitary whitespace. All the best, Jacinta -- ("`-''-/").___..--''"`-._ | Jacinta Richardson | `6_ 6 ) `-. ( ).`-.__.`) | Perl Training Australia | (_Y_.)' ._ ) `._ `. ``-..-' | +61 3 9354 6001 | _..`--'_..-_/ /--'_.' ,' | contact at perltraining.com.au | (il),-'' (li),' ((!.-' | www.perltraining.com.au | From martin_jacobs at optusnet.com.au Mon Apr 16 21:08:02 2007 From: martin_jacobs at optusnet.com.au (Martin Jacobs) Date: Tue, 17 Apr 2007 14:08:02 +1000 Subject: [Brisbane-pm] Regex Syntax - Context is everything Message-ID: <92AAA3AD-2E58-4B38-B8AF-152ABB0D5B19@optusnet.com.au> Thanks for all your replies, Got it fixed with Jacinta's suggestion, which condenses to... ($rain) = ($rain =~ m|(\d+(:?\.\d+)?)|x); I had read about list and scalar context, but I must admit that it I could not make any sense of it at the time. Regards, Martin Visit my website... http://web.mac.com/martin_jacobs1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/pipermail/brisbane-pm/attachments/20070417/d3fc260f/attachment.html From djames at thehub.com.au Mon Apr 16 23:29:07 2007 From: djames at thehub.com.au (Damian James) Date: Tue, 17 Apr 2007 16:29:07 +1000 Subject: [Brisbane-pm] Regex Syntax - Context is everything In-Reply-To: <92AAA3AD-2E58-4B38-B8AF-152ABB0D5B19@optusnet.com.au> References: <92AAA3AD-2E58-4B38-B8AF-152ABB0D5B19@optusnet.com.au> Message-ID: On 17/04/2007, at 2:08 PM, Martin Jacobs wrote: > > ($rain) = ($rain =~ m|(\d+(:?\.\d+)?)|x); > > I had read about list and scalar context, but I must admit that it > I could not make any sense of it at the time. Can also do: $rain = ($rain =~ m|(\d+(:?\.\d+)?)|x)[0]; Which is a list slice with one element. You could make the index -1 to always pick the last match, if there are multiple, and that would be the same as the behaviour above. The version I gave picks the first match Cheers Damian