[Pdx-pm] Learning HTML Parsing

Fri Mar 19 01:48:11 CST 2004

On Thu, 2004-03-18 at 23:35, James marks wrote:

> The default scalar variable, $_, takes in the entire text of the HTML 
> file as a string.

Unfortunately, no.  You've (innocently and inadvertently) asked Perl for
one thing, and Perl obliges by giving you one line from the file.

If you want the whole thing, use this instead:

	local $/;
	$_ = <FILEIN>;

$/ is a magic variable that contains the "input record specifier". 
That's the Perlish way of specifying the characters that come at the end
of a record.  In this case, it's a newline (whatever that is on your
platform).

By localizing it, you're effectively removing that value, so Perl will
slurp up the entire file for you into that variable, which is what you
want.

As for the actual parsing of HTML, I'm pretty happy with running HTML
through tidy, producing XML, then using XML::Parser's SAX callbacks, but
that's not exactly what you asked and it's pretty cruel to inflict both
XML and SAX on someone at this hour.

It'd be easier to give you pointers if you gave us a snippet of real
HTML and what you were looking for in it though.  Sometimes regexes are
pretty quick.

-- c