[Pdx-pm] Learning HTML Parsing
jamarks at jamarks.com
Fri Mar 19 02:25:13 CST 2004
On Mar 18, 2004, at 11:48 PM, chromatic wrote:
>> The default scalar variable, $_, takes in the entire text of the HTML
>> file as a string.
> Unfortunately, no. You've (innocently and inadvertently) asked Perl
> one thing, and Perl obliges by giving you one line from the file.
> If you want the whole thing, use this instead:
> local $/;
> $_ = <FILEIN>;
I'm not sure I understand. When I run the script in Affrus (my Perl
debugger) it shows the value of $_ as containing the entire HTML file.
The string does contain line breaks, but that doesn't make it something
other than a string, does it? (Like an array, for example.)
The script I posted does, in fact, loop through the regex matches as I
was wanting. At least on my machine (mac osx).
> $/ is a magic variable that contains the "input record specifier".
So, the default for $/ is newline, correct? I'm not necessarily wanting
to limit my regex search to one line, however.
> As for the actual parsing of HTML, I'm pretty happy with running HTML
> through tidy, producing XML, then using XML::Parser's SAX callbacks,
> that's not exactly what you asked and it's pretty cruel to inflict both
> XML and SAX on someone at this hour.
> It'd be easier to give you pointers if you gave us a snippet of real
> HTML and what you were looking for in it though. Sometimes regexes are
> pretty quick.
Actually, I'm just trying to understand the best method for looping
through regex matches in a long text document. I'm not actually
searching for any substring in particular. I suspect there's a better
way that the one I've used (swapping $' for $_).
More information about the Pdx-pm-list