[Pdx-pm] Learning HTML Parsing

James marks jamarks at jamarks.com
Fri Mar 19 02:25:13 CST 2004


On Mar 18, 2004, at 11:48 PM, chromatic wrote:

>> The default scalar variable, $_, takes in the entire text of the HTML
>> file as a string.
>
> Unfortunately, no.  You've (innocently and inadvertently) asked Perl 
> for
> one thing, and Perl obliges by giving you one line from the file.
>
> If you want the whole thing, use this instead:
>
> 	local $/;
> 	$_ = <FILEIN>;

I'm not sure I understand. When I run the script in Affrus (my Perl 
debugger) it shows the value of $_ as containing the entire HTML file. 
The string does contain line breaks, but that doesn't make it something 
other than a string, does it? (Like an array, for example.)

The script I posted does, in fact, loop through the regex matches as I 
was wanting. At least on my machine (mac osx).

> $/ is a magic variable that contains the "input record specifier".

So, the default for $/ is newline, correct? I'm not necessarily wanting 
to limit my regex search to one line, however.

> As for the actual parsing of HTML, I'm pretty happy with running HTML
> through tidy, producing XML, then using XML::Parser's SAX callbacks, 
> but
> that's not exactly what you asked and it's pretty cruel to inflict both
> XML and SAX on someone at this hour.

:) Thanks!

> It'd be easier to give you pointers if you gave us a snippet of real
> HTML and what you were looking for in it though.  Sometimes regexes are
> pretty quick.

Actually, I'm just trying to understand the best method for looping 
through regex matches in a long text document. I'm not actually 
searching for any substring in particular. I suspect there's a better 
way that the one I've used (swapping $' for $_).

Thanks,
James




More information about the Pdx-pm-list mailing list