HTML::TreeBuilder, Tidy.exe

Masque masque at pound.perl.org
Thu Nov 16 16:07:51 CST 2000


Majordomo doesn't seem to like subroutine declarations.  :]  I'm commenting out
the line that caused majordomo to reject this and passing the rest on untouched.

Paul.
----- Forwarded message from owner-pdx-pm-list at pm.org -----

Date: Thu, 16 Nov 2000 16:22:05 -0500 (EST)
From: owner-pdx-pm-list at pm.org
To: owner-pdx-pm-list at pm.org
Subject: BOUNCE pdx-pm-list at pm.org:     Admin request of type /^sub\b/i at line 8  

Date: Thu, 16 Nov 2000 13:20:28 -0800
From: Jeff Zucker <jeff at vpservices.com>
X-Mailer: Mozilla 4.7 [en] (Win98; U)
MIME-Version: 1.0
To: Daniel Chetlin <daniel at chetlin.com>
CC: pdx-pm-list at pm.org
Subject: HTML::TreeBuilder, Tidy.exe
References: <sa0ead01.009 at gwsmtp.ohsu.edu> <20001115142918.J314 at surly.eli.net> <20001116022715.A999 at darkstar.chetlin.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Daniel, thanks for a great talk the other night.  I've been
experimenting with TreeBuilder.  Here's a snippet that will change the
base href if one exists, or insert one if none exists.  (Not that I ever
use base hrefs, I did it up in response to the clpm user who requested
it, but he was too rude to Randal for me to send it to him.)  Is this
how you'd do it?

# sub insert_base {
    my($html_string,$new_URI) = @_;
    use HTML::TreeBuilder;
    my $tree = HTML::TreeBuilder->new;
    $tree->parse($html_string);
    $tree->eof;
    my $head = $tree->look_down('_tag','head');
    my $base = $tree->look_down('_tag','base')
            || $head->new('base');
    $base->{href} = $new_URI;
    $head->push_content($base);
    $html_string = $tree->as_HTML;
    $tree->delete;
    return($html_string);
}

Interestingly this works regardless of whether the original HTML
includes a head tag or not, since TreeBuilder seems to insert one if
none exists. 

Also, I wanted to mention a great resource one might want to use in
conjunction with HTML::Parser or HTML::TreeBuilder -- the w3's tidy.exe
program that does a good job of cleaning up bad HTML and producing XHTML
and several other tasks.

-- 
Jeff

----- End forwarded message -----
TIMTOWTDI



More information about the Pdx-pm-list mailing list