[Pdx-pm] Noob...

Ovid curtis_ovid_poe at yahoo.com
Fri Mar 19 10:16:00 CST 2004


Hi James,

> but could use some advice as I go about trying to solve a practice 
> project (I'm just getting started at learning Perl):
> 
> I want to parse an HTML file that exists on the web

Obviously, with the advice people have given you, it's clear that there
are many ways of approaching this problem.  Here's one way that's easy
to understand and use.

  use LWP::Simple;                 # get the web page
  use HTML::FormatText::WithLinks; # parse the HTML
                                                                       
                                                     
  # get is from LWP::Simple
  my $document = get('http://www.cnn.com/');

  # always do sanity checking unless you want difficult to fix bugs
  die "No document" unless $document;
                                                                       
                                                     
  my $f = HTML::FormatText::WithLinks->new;
                                                                       
                                                       my $text =
$f->parse($document);
  print $text;

That allows for easy formatting, but here's a variation that also does
this.  You lose some formatting, but you can use the module to gain
greater control over what you're trying to parse

  use LWP::Simple;              # get the web page
  use HTML::TokeParser::Simple; # parse the web page
  use HTML::Entities;           # decode special characters
                                                                       
                                                     
  # get is from LWP::Simple
  my $document = get('http://www.yahoo.com/');
  die "No document" unless $document;
                                                                       
                                                     
  my $parser = HTML::TokeParser::Simple->new(\$document);
  $parser->get_tag('body'); # advance to first body tag
                                                                       
                                                     
  while (my $token = $parser->get_token) {
    next unless $token->is_text;
    # decode_entities is from HTML::Entities
    print decode_entities($token->as_is);
  }

That while{} loop is where you would be doing the bulk of your parsing.
 Did you only want to print out HTML comment?  It's pretty simple:

  while (my $token = $parser->get_token) {
    next unless $token->is_comment;
    # decode_entities is from HTML::Entities
    print decode_entities($token->as_is);
  }

See the docs on CPAN.org for more information.

Good luck!

Cheers,
Ovid

=====
Silence is Evil            http://users.easystreet.com/ovid/philosophy/indexdecency.htm
Ovid                       http://www.perlmonks.org/index.pl?node_id=17000
Web Programming with Perl  http://users.easystreet.com/ovid/cgi_course/



More information about the Pdx-pm-list mailing list