[sf-perl] XML Parsing question

Bill Moseley moseley at hank.org
Thu Mar 3 15:41:31 PST 2011


I have a collection of XML files that have one or more <result> elements in
each file.  The goal is have a script where I can pass one or more files on
the command line which will gather up all the <result> elements from each
file and combine into a single XML output file.

I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick
scripts) and did the following.  I seem to not work with XML that much
(which I think is a bit lucky), so there may be easier ways to do this.

#!/usr/bin/perl
use strict;
use warnings;
use XML::TreeBuilder;
use XML::Element;
use Encode;


my $doc = XML::Element->new( 'testResults' );

for my $path ( @ARGV ) {
    my $tree = XML::TreeBuilder->new;
    $tree->parse_file( $path );

    $doc->push_content( $tree->look_down( '_tag', 'result' ) );

    $tree->delete;
}

print join "\n",
    '<?xml version="1.0" encoding="UTF-8"?>',
    encode_utf8( $doc->as_XML );


That seems to work ok.  But, then I ended up with a file that had a CDATA
section (which happened to hold a snippet of HTML).  That's fine, but
$doc->as_XML then encoded the entities.

With this source file:

$ cat test.xml
<?xml version="1.0" encoding="UTF-8"?>
<testResults>
    <result>
        <content><![CDATA[<strong>this is&nbsp;strong</strong>]]></content>
    </result>
</testResults>

I run through the script I get:

$ cat new.xml
<?xml version="1.0" encoding="UTF-8"?>
<testResults><result>
        <content>&#60;strong&#62;this
is&nbsp;strong&#60;/strong&#62;</content>
    </result></testResults>

And if I then run *that* file back through the script I get:

undefined entity at line 3, column 40, byte 101 at
/usr/lib/perl5/XML/Parser.pm line 187


which is choking at the &nbsp;


My questions are:

1) Is there a better approach to doing this that preserves the CDATA
sections?

2) Is there a way to define the &nbsp; entity?  I tried adding DTD to
defined the &nbsp;, but wasn't able to make the parser happy in my attempts.



-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/sanfrancisco-pm/attachments/20110303/34b9e8da/attachment.html>


More information about the SanFrancisco-pm mailing list