Phoenix.pm: HTML::Parser weirdness
Douglas E. Miles
doug.miles at bpxinternet.com
Mon May 1 09:09:58 CDT 2000
Scott Walters wrote:
>
> Hi all, Doug..
>
> [Btw, from previous message, Hall Kinion is a headhunter, and not an
> outstanding one at that].
>
> If 'E0' was showing up, I would have a guess, but I see no significance to
> 'A0'... If you're still battling this, post sources and let us have a go!
>
> -scott
>
> On Tue, 25 Apr 2000, Douglas E. Miles wrote:
>
> > Anyone out there using HTML::Parser? I'm using it to extract just text
> > from HTML files. The strange thing is that hex A0 keeps showing up in
> > the extracted text, but does not appear in the original file. Right
> > now, I using a regex to filter them out, but I'd like to understand
> > where they're coming from, and why. Any ideas? Thanks.
Sorry this took so long. I've just been completely buried recently.
Here is an example that comes with HTML::Parser, that I've hacked to
show the problem. Attached is the program, htext, and a test html file,
admin.html. Just type htext admin.html > admin.txt (after making it
executable), and you will see AOs in admin.txt.
--
- Doug
"A synonym is a word you use when you can't spell the
word you first thought of."
--Burt Bacharach
-------------- next part --------------
#!/usr/bin/perl -w
# Extract all plain text from an HTML file
use strict;
use HTML::Parser 3.00 ();
my $text;
my %inside;
sub tag
{
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}
sub text
{
return if $inside{script} || $inside{style};
$text .= $_[0];
#print $_[0];
print $text;
}
HTML::Parser->new(api_version => 3,
handlers => [start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/archives/phoenix-pm/attachments/20000501/1471063e/admin.html
More information about the Phoenix-pm
mailing list