[Chicago-talk] Regex and the whitespace before it.
Steven Lembark
lembark at wrkhors.com
Wed Mar 26 13:22:08 PDT 2008
Mike Fragassi wrote:
> Mike --
>
> You're welcome for the help. Hopefully this will help with the current
> problem:
>
> split /(\w+)\s*:/
>
> Capturing parentheses in the split regex will return that portion of
> the match in the output.
>
> my $string =<<STRING;
> subject1 : description of a subject. subject2 : description of a subject.
> subject3 : description of a subject. etc etc
> STRING
> my @aa = split /(\w+)\s*:\s*/, $string;
> $,="\n";
> print @aa;
If you can force the entries onto single lines then
newlines become reasonable delimeters and you can
break them up that way:
for( split "\n", $string )
{
my ( $subject, $contents ) = split m{ \s*:\s* }x, $_, 2;
# whatever...
push @found, [ $subjet, $contents ]
}
Otherwise, if the descriptions slop over then you
know that subjects start at offset zero with word
followed by optinal space and a colon:
my $sub = '';
my $data = '';
for( split "\n", $string )
{
if( ( $sub, $data ) = m{ ^ (\S+) \s+ : \s+ (.+) }x )
{
push @found, [ $sub, $data ];
}
else
{
$found[-1][-1] .= $_;
}
}
If the stuff is aligned and delimted SOMEhow then
you can use split, regexes, or index + substr to
get the data out (that I can think of sitting here).
If you can't even guarantee that the stuff is left-
aligned or that extra colon char's won't appear in
the data anywhere then you don't have delimited data
and had better find a nice, solid AI parser that can
correctly determine the grammitical context of the
titles in order to pick them out.
--
Steven Lembark +1 888 359 3508
Workhorse Computing 85-09 90th St
lembark at wrkhors.com Woodhaven, NY 11421
More information about the Chicago-talk
mailing list