[sf-perl] XML Parsing question

Bill Moseley moseley at hank.org
Fri Mar 4 17:45:02 PST 2011


Ya, best to avoid XML when possible.  So, I just hacked in this to deal with
the entity encoding:

{
    no warnings 'redefine';
    sub HTML::Element::_xml_escape {
        for ( @_ ) {
            return unless length && /</;
            s{]]>}{]]&#62;}g;
            $_ = "<![CDATA[$_]]>";
        }
    }
}

Likely only good for this one-off, but was curious how the "right" way to
handle this would be.

On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <moseley at hank.org> wrote:

> I have a collection of XML files that have one or more <result> elements in
> each file.  The goal is have a script where I can pass one or more files on
> the command line which will gather up all the <result> elements from each
> file and combine into a single XML output file.
>
> I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick
> scripts) and did the following.  I seem to not work with XML that much
> (which I think is a bit lucky), so there may be easier ways to do this.
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use XML::TreeBuilder;
> use XML::Element;
> use Encode;
>
>
> my $doc = XML::Element->new( 'testResults' );
>
> for my $path ( @ARGV ) {
>     my $tree = XML::TreeBuilder->new;
>     $tree->parse_file( $path );
>
>     $doc->push_content( $tree->look_down( '_tag', 'result' ) );
>
>     $tree->delete;
> }
>
> print join "\n",
>     '<?xml version="1.0" encoding="UTF-8"?>',
>     encode_utf8( $doc->as_XML );
>
>
> That seems to work ok.  But, then I ended up with a file that had a CDATA
> section (which happened to hold a snippet of HTML).  That's fine, but
> $doc->as_XML then encoded the entities.
>
> With this source file:
>
> $ cat test.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults>
>     <result>
>         <content><![CDATA[<strong>this is&nbsp;strong</strong>]]></content>
>     </result>
> </testResults>
>
> I run through the script I get:
>
> $ cat new.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults><result>
>         <content>&#60;strong&#62;this
> is&nbsp;strong&#60;/strong&#62;</content>
>     </result></testResults>
>
> And if I then run *that* file back through the script I get:
>
> undefined entity at line 3, column 40, byte 101 at
> /usr/lib/perl5/XML/Parser.pm line 187
>
>
> which is choking at the &nbsp;
>
>
> My questions are:
>
> 1) Is there a better approach to doing this that preserves the CDATA
> sections?
>
> 2) Is there a way to define the &nbsp; entity?  I tried adding DTD to
> defined the &nbsp;, but wasn't able to make the parser happy in my attempts.
>
>
>
> --
> Bill Moseley
> moseley at hank.org
>



-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/sanfrancisco-pm/attachments/20110304/97e16139/attachment.html>


More information about the SanFrancisco-pm mailing list