[Pdx-pm] Noob...

Fri Mar 19 01:35:53 CST 2004

Ok. As I said in an earlier post, I downloaded the HTML file just to 
make the project a little easier although I know it's possible to 
download the file from the web using a Perl script. I'm just trying to 
take things one step at a time.

Here's my script.

#!/usr/bin/perl

open FILEIN, "some_web_page.html";    #web page saved to my drive
$_       = <FILEIN>;
$counter = 0;

while (/<B>.*?<\/B>/) {
     print "$&\n\n";    # this is just to test the success of the match
     $counter += 1;
     $_ = $';
}
print "There were $counter matches.\n";
close FILEIN;

My understanding is this:

The default scalar variable, $_, takes in the entire text of the HTML 
file as a string.

Because the value of $_ is a single long string rather than an array, a 
foreach loop wouldn't be appropriate here so I've used a while loop 
instead.

I want to step through the long string one regex match at a time, 
however, and extract the matched substring.

My solution was to replace the value of $_ with that of $' (everything 
after the regex match) then search $_ again for the next match.

I suspect there's a better, or at least more common way to do this but 
I haven't found it yet.

How would the more experienced programmers do it?

Thanks for your help!

James

On Mar 18, 2004, at 12:39 AM, Joshua Keroes wrote:

>
> On Mar 17, 2004, at 11:48 PM, James marks wrote:
>> I want to parse an HTML file that exists on the web.
>
> What URL are you fetching?
> What's the desired datum you want?
> Can you post your script here?
>
> Thanks,
> J
>
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at mail.pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>