[Pdx-pm] Noob...
Ovid
curtis_ovid_poe at yahoo.com
Fri Mar 19 10:16:00 CST 2004
Hi James,
> but could use some advice as I go about trying to solve a practice
> project (I'm just getting started at learning Perl):
>
> I want to parse an HTML file that exists on the web
Obviously, with the advice people have given you, it's clear that there
are many ways of approaching this problem. Here's one way that's easy
to understand and use.
use LWP::Simple; # get the web page
use HTML::FormatText::WithLinks; # parse the HTML
# get is from LWP::Simple
my $document = get('http://www.cnn.com/');
# always do sanity checking unless you want difficult to fix bugs
die "No document" unless $document;
my $f = HTML::FormatText::WithLinks->new;
my $text =
$f->parse($document);
print $text;
That allows for easy formatting, but here's a variation that also does
this. You lose some formatting, but you can use the module to gain
greater control over what you're trying to parse
use LWP::Simple; # get the web page
use HTML::TokeParser::Simple; # parse the web page
use HTML::Entities; # decode special characters
# get is from LWP::Simple
my $document = get('http://www.yahoo.com/');
die "No document" unless $document;
my $parser = HTML::TokeParser::Simple->new(\$document);
$parser->get_tag('body'); # advance to first body tag
while (my $token = $parser->get_token) {
next unless $token->is_text;
# decode_entities is from HTML::Entities
print decode_entities($token->as_is);
}
That while{} loop is where you would be doing the bulk of your parsing.
Did you only want to print out HTML comment? It's pretty simple:
while (my $token = $parser->get_token) {
next unless $token->is_comment;
# decode_entities is from HTML::Entities
print decode_entities($token->as_is);
}
See the docs on CPAN.org for more information.
Good luck!
Cheers,
Ovid
=====
Silence is Evil http://users.easystreet.com/ovid/philosophy/indexdecency.htm
Ovid http://www.perlmonks.org/index.pl?node_id=17000
Web Programming with Perl http://users.easystreet.com/ovid/cgi_course/
More information about the Pdx-pm-list
mailing list