tkil at scrye.com
Thu Mar 18 03:00:45 CST 2004
>>>>> "James" == James marks <jamarks at jamarks.com> writes:
James> I know I can have Perl retrieve the file but, to make things
James> simpler, I've downloaded it to my hard drive.
James> My attempts so far have involved reading the entire file into a
James> variable (I'm using a demo version of Late Night Software's
James> Affrus, by the way) then trying to parse the string contained
James> by the variable.
If you just want to get the entire contents into a single variable,
there's LWP::Simple (LWP stands for "libwww-perl"):
use LWP::Simple qw( get );
my $page = get "http://blah.whatever.com";
James> Is it "better" Perl programming to have the Perl script read
James> the file in one line at a time or is reading the entire file
James> into a variable ok?
Having only the minimal amount of text in memory is a good thing...
but keeping track of what is "minimal" can be difficult. As an
example, I often write HTML that looks like this:
If you read it line-by-line, you will very probably miss that.
James> How do I have the script jump from the first regex match the
James> the next?
The typical idiom is to use \G in conjunction with /gc flags. There
are good examples of this in Freidl's _Mastering Regular Expressions_.
In the free documentation, look at the use of /gc in the "perldoc
perlre" (also available online at www.perldoc.com).
(For what it's worth, the example in _MRE_ is exactly that of doing
primitive parsing of HTML.)
Finally, note that there are modules which can parse HTML and do
various things with it, including callbacks in the SAX style (which
basically means that you can have the module do the "hard work" of
parsing, while you subclass it and figure out what to do when it spots
an opening tag, a closing tag, a comment, plain text, etc...)
Shockingly enough, one of them is called "HTML::Parser", although I am
not sure if there are other more-preferred methods these days.
There are also modules that parse HTML into memory (instead of a
stream of events); HTML::TreeBuilder used to be one of these, if I
If you can find some cached versions of my perl samples, there are
some that deal with this sort of thing (e.g., extracting links +
descriptions from HTML). The disk they are hosted on has gone totally
dead, however, and I haven't had the opportunity to rebuild it.
James> Are Perl scripts normally invoked in the Terminal? I don't see
James> how, other than making an executable, a script would be
James> launched from the GUI.
Depending on which GUI you're talking about (OSX? Win32? X window WM
of some sort?), you can likely set up a "shortcut" that invokes the
command-line environment, then runs the script within it.
I don't remember the details, but there is a magic suffix you can add
to file names in OSX that cause them to be interpreted by the shell.
Also, I suspect that OSX is smart enough to understand that executable
bits + "#! line" means executable, but I've not had the opportunity to
try it myself. Randal is a MacHead, perhaps he'll chirp up.
p.s. In addition to www.perldoc.com, familiarize yourself with
search.cpan.org and the perlfaq. In particular, there is an
entry in the FAQ that deals with stripping out HTML, which might
give you some pointers.
If you want to know why I waffle on so many of the HTML-specific
portions, consider that this is not unreasonable HTML:
<img src="blah.png" alt="<bang!>" />
Finding a regex that can properly handle this is exciting. Read
_Mastering Regular Expressions_ for much, much, much more detail.
More information about the Pdx-pm-list