[Omaha.pm] A less greedy regular expression...

Jay Hannah jay at jays.net
Wed Sep 22 01:31:56 CDT 2004


On Sep 22, 2004, at 12:13 AM, Daniel Linder wrote:
> I have a variable with content that looks like this:
>
> $a = "AAAbbbCCCAAAdddCCCAAAdddCCCAAAdddCCC";
>
> Basically the "AAA" and "CCC" strings are begin and end markers for the
> text I am interested in (specifically the "bbb" or "ddd" strings).
>
> When I use this command to strip off the "markers"
> $a =~ s/AAA(.*)CCC/$1/;
>
> The $a variable ends up containing "bbbCCCAAAdddCCCAAAdddCCCAAAddd" 
> (i.e.
> the first "AAA" and the last "CCC" were removed).  What I had hoped for
> was to have the first "bbb" returned.
>
> I think the cause of this is that the =~ command is 'greedy' and will
> match the longest string it can find.  Since the number and pattern of 
> the
> remaining markers are random, is there a flag I can pass via the 
> regexp to
> have it match on the first/smallest match?

Negative. '=~' isn't greedy. '*' is. If you want to do "minimal 
matching" you need to use '*?'. Like so:

$a =~ s/AAA(.*?)CCC/$1/;

> A work around I am looking at involves the "split" command like this:
> ($foo, $a, $bar) = split ("AAA|CCC", $a);
>
> Other ideas?

1) I've heard Text::Balanced is neat. I've never used it.

2) Use matching operator (m//g) instead of substitute (s//$1/) to get 
all your strings in one fell swoop:

$a = "AAAbbbCCCAAAdddCCCAAAdddCCCAAAdddCCC";
@strings = ($a =~ /AAA(.*?)CCC/g);
print join "|", @strings;

3) Go home because it's 01:30 and you're tired of telco crap.

Grin,

j



More information about the Omaha-pm mailing list