[sf-perl] XML Parsing question

Bill Moseley moseley at hank.org
Thu Mar 3 15:41:31 PST 2011

I have a collection of XML files that have one or more <result> elements in
each file.  The goal is have a script where I can pass one or more files on
the command line which will gather up all the <result> elements from each
file and combine into a single XML output file.

I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick
scripts) and did the following.  I seem to not work with XML that much
(which I think is a bit lucky), so there may be easier ways to do this.

use strict;
use warnings;
use XML::TreeBuilder;
use XML::Element;
use Encode;

my $doc = XML::Element->new( 'testResults' );

for my $path ( @ARGV ) {
    my $tree = XML::TreeBuilder->new;
    $tree->parse_file( $path );

    $doc->push_content( $tree->look_down( '_tag', 'result' ) );


print join "\n",
    '<?xml version="1.0" encoding="UTF-8"?>',
    encode_utf8( $doc->as_XML );

That seems to work ok.  But, then I ended up with a file that had a CDATA
section (which happened to hold a snippet of HTML).  That's fine, but
$doc->as_XML then encoded the entities.

With this source file:

$ cat test.xml
<?xml version="1.0" encoding="UTF-8"?>
        <content><![CDATA[<strong>this is&nbsp;strong</strong>]]></content>

I run through the script I get:

$ cat new.xml
<?xml version="1.0" encoding="UTF-8"?>

And if I then run *that* file back through the script I get:

undefined entity at line 3, column 40, byte 101 at
/usr/lib/perl5/XML/Parser.pm line 187

which is choking at the &nbsp;

My questions are:

1) Is there a better approach to doing this that preserves the CDATA

2) Is there a way to define the &nbsp; entity?  I tried adding DTD to
defined the &nbsp;, but wasn't able to make the parser happy in my attempts.

Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/sanfrancisco-pm/attachments/20110303/34b9e8da/attachment.html>

More information about the SanFrancisco-pm mailing list