Phoenix.pm: HTML::Parser weirdness

Mon May 1 09:09:58 CDT 2000

Scott Walters wrote:
> 
> Hi all, Doug..
> 
> [Btw, from previous message, Hall Kinion is a headhunter, and not an
> outstanding one at that].
> 
> If 'E0' was showing up, I would have a guess, but I see no significance to
> 'A0'... If you're still battling this, post sources and let us have a go!
> 
> -scott
> 
> On Tue, 25 Apr 2000, Douglas E. Miles wrote:
> 
> > Anyone out there using HTML::Parser?  I'm using it to extract just text
> > from HTML files.  The strange thing is that hex A0 keeps showing up in
> > the extracted text, but does not appear in the original file.  Right
> > now, I using a regex to filter them out, but I'd like to understand
> > where they're coming from, and why.  Any ideas?  Thanks.

Sorry this took so long.  I've just been completely buried recently. 
Here is an example that comes with HTML::Parser, that I've hacked to
show the problem.  Attached is the program, htext, and a test html file,
admin.html.  Just type htext admin.html > admin.txt (after making it
executable), and you will see AOs in admin.txt.

-- 
- Doug

"A synonym is a word you use when you can't spell the 
word you first thought of."
--Burt Bacharach
-------------- next part --------------
#!/usr/bin/perl -w

# Extract all plain text from an HTML file

use strict;
use HTML::Parser 3.00 ();
my $text;

my %inside;

sub tag
{
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text
{
    return if $inside{script} || $inside{style};

    $text .= $_[0];
    #print $_[0];
    print $text;
}

HTML::Parser->new(api_version => 3,
		  handlers    => [start => [\&tag, "tagname, '+1'"],
				  end   => [\&tag, "tagname, '-1'"],
				  text  => [\&text, "dtext"],
				 ],
		  marked_sections => 1,
	)->parse_file(shift) || die "Can't open file: $!\n";
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/archives/phoenix-pm/attachments/20000501/1471063e/admin.html