[Chicago-talk] Regex and the whitespace before it.

Wed Mar 26 13:22:08 PDT 2008

Mike Fragassi wrote:
 > Mike --
 >
 > You're welcome for the help.  Hopefully this will help with the current
 > problem:
 >
 >     split /(\w+)\s*:/
 >
 > Capturing parentheses in the split regex will return that portion of
 > the match in the output.
 >
 > my $string =<<STRING;
 > subject1 : description of a subject. subject2 : description of a subject.
 > subject3 : description of a subject.  etc etc
 > STRING
 > my @aa = split /(\w+)\s*:\s*/, $string;
 > $,="\n";
 > print @aa;

If you can force the entries onto single lines then
newlines become reasonable delimeters and you can
break them up that way:

     for( split "\n", $string )
     {
         my ( $subject, $contents ) = split m{ \s*:\s* }x, $_, 2;

         # whatever...

         push @found, [ $subjet, $contents ]
     }

Otherwise, if the descriptions slop over then you
know that subjects start at offset zero with word
followed by optinal space and a colon:

     my $sub     = '';
     my $data    = '';

     for( split "\n", $string )
     {
         if( ( $sub, $data ) = m{ ^ (\S+) \s+ : \s+ (.+) }x )
         {
             push @found, [ $sub, $data ];
         }
         else
         {
             $found[-1][-1] .= $_;
         }
     }

If the stuff is aligned and delimted SOMEhow then
you can use split, regexes, or index + substr to
get the data out (that I can think of sitting here).

If you can't even guarantee that the stuff is left-
aligned or that extra colon char's won't appear in
the data anywhere then you don't have delimited data
and had better find a nice, solid AI parser that can
correctly determine the grammitical context of the
titles in order to pick them out.

-- 
Steven Lembark                                          +1 888 359 3508
Workhorse Computing                                       85-09 90th St
lembark at wrkhors.com                                 Woodhaven, NY 11421