[Pdx-pm] Noob...
James marks
jamarks at jamarks.com
Fri Mar 19 01:35:53 CST 2004
Ok. As I said in an earlier post, I downloaded the HTML file just to
make the project a little easier although I know it's possible to
download the file from the web using a Perl script. I'm just trying to
take things one step at a time.
Here's my script.
#!/usr/bin/perl
open FILEIN, "some_web_page.html"; #web page saved to my drive
$_ = <FILEIN>;
$counter = 0;
while (/<B>.*?<\/B>/) {
print "$&\n\n"; # this is just to test the success of the match
$counter += 1;
$_ = $';
}
print "There were $counter matches.\n";
close FILEIN;
My understanding is this:
The default scalar variable, $_, takes in the entire text of the HTML
file as a string.
Because the value of $_ is a single long string rather than an array, a
foreach loop wouldn't be appropriate here so I've used a while loop
instead.
I want to step through the long string one regex match at a time,
however, and extract the matched substring.
My solution was to replace the value of $_ with that of $' (everything
after the regex match) then search $_ again for the next match.
I suspect there's a better, or at least more common way to do this but
I haven't found it yet.
How would the more experienced programmers do it?
Thanks for your help!
James
On Mar 18, 2004, at 12:39 AM, Joshua Keroes wrote:
>
> On Mar 17, 2004, at 11:48 PM, James marks wrote:
>> I want to parse an HTML file that exists on the web.
>
> What URL are you fetching?
> What's the desired datum you want?
> Can you post your script here?
>
> Thanks,
> J
>
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at mail.pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>
More information about the Pdx-pm-list
mailing list