[Pdx-pm] Noob...

Thu Mar 18 02:38:23 CST 2004

On Wed, Mar 17, 2004 at 11:48:13PM -0800, James marks wrote:
> Hi,
> 
> I'm wondering if you can give me some  help. I'm trying to learn Perl 
> but could use some advice as I go about trying to solve a practice 
> project (I'm just getting started at learning Perl):
> 
> I want to parse an HTML file that exists on the web.
> 
> I know I can  have Perl retrieve the file but, to make things simpler, 
> I've downloaded it to my hard drive.
> 

	Sure. I hear the LWP module is handy for that, but as you say best
to make life easy to start off with.

> My attempts so far have involved reading the entire file into a 
> variable (I'm using a demo version of Late Night Software's Affrus, by 
> the way) then trying to parse the string contained by the variable.
> 
	You might consider looking for something from cpan.org, e.g.
HTML::Parser, which would do what you want without reinventing the wheel.
Further, you can usually use the CPAN module to do the work of building and
installing it for you, by running 'perl -MCPAN -e shell'.
	But sometimes it's worthwhile to build your own wheel so you can
get a better feel for what's involved with building one.

> Is it "better" Perl programming to have the Perl script read the file 
> in one line at a time or is reading the entire file into a variable ok?
> 

	If you are reading in a file which is small compared to the available
system ram, it's usually faster/better to read in a file all at once. If you
are reading in a huge file, such as a web server log file, it's better to do it
in small chunks or line by line. In general doing one big read is faster than a
lot of little ones, unless you are going to starve your machine for ram.

> How do I have the script jump from the first regex match the the next? 
> I've tried several loops but none have worked so far. Should I be 
> trying to work with the string contained in the $' ($POSTMATCH) 
> variable?
> 
	The way _I_ usually do this is to have the regex eat some of the
data in the buffer, using s///, e.g.:

my($buffer);
$buffer = "<html><head><title>my page</title></head><body>whee</body></html>";

my($before_tag, $tag);
# Regex: remove everything up to the first <> tag. Everything before the
# tag will go into $1, everything in the tag $2 because they are in ()s.
while ( $buffer =~ s/^(.*?)(\<[^\<]\>)// ) {
  $before_tag = $1;
  $tag = $2;
  # Do stuff with the tag...
  ...
}

	This probably doesn't do what you intend to do with your parser,
but hopefully it will help you understand the mechanism. As the aged saw
goes, "TMTOWTDI". That's one way.

> Are Perl scripts normally invoked in the Terminal? I don't see how, 
> other than making an executable, a script would be launched from the 
> GUI.
> 
	see 'perldoc perlrun'. 

> Again, I'm just starting out in Perl so I'm sure some of these 
> questions seem pretty elementary. I'm coming to Perl from about 3 years 
> using AppleScript so I think I have a handle on the concepts of 
> variables, globals, loops, branching, libraries, etc... but I haven't 
> really worked with the UNIX command line before this.
> 

	Yeah, that's great. Any previous programming experience definitely
helps. Welcome to the Perl community.

	Austin