I have a collection of XML files that have one or more <result> elements in each file. The goal is have a script where I can pass one or more files on the command line which will gather up all the <result> elements from each file and combine into a single XML output file.<div>
<br></div><div>I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick scripts) and did the following. I seem to not work with XML that much (which I think is a bit lucky), so there may be easier ways to do this.</div>
<div><br></div><div><div><font face="'courier new', monospace">#!/usr/bin/perl</font></div><div><font face="'courier new', monospace">use strict;</font></div>
<div><font face="'courier new', monospace">use warnings;</font></div><div><font face="'courier new', monospace">use XML::TreeBuilder;</font></div><div><font face="'courier new', monospace">use XML::Element;</font></div>
<div><font face="'courier new', monospace">use Encode;</font></div><div><font face="'courier new', monospace"><br></font></div><div><font face="'courier new', monospace"><br>
</font></div><div><font face="'courier new', monospace">my $doc = XML::Element->new( 'testResults' );</font></div><div><font face="'courier new', monospace"><br>
</font></div><div><font face="'courier new', monospace">for my $path ( @ARGV ) {</font></div><div><font face="'courier new', monospace"> my $tree = XML::TreeBuilder->new;</font></div>
<div><font face="'courier new', monospace"> $tree->parse_file( $path );</font></div><div><font face="'courier new', monospace"><br></font></div><div>
<font face="'courier new', monospace"> $doc->push_content( $tree->look_down( '_tag', 'result' ) );</font></div><div><font face="'courier new', monospace"><br>
</font></div><div><font face="'courier new', monospace"> $tree->delete;</font></div><div><font face="'courier new', monospace">}</font></div><div><font class="Apple-style-span" face="'courier new', monospace"><br>
</font></div><div><font face="'courier new', monospace">print join "\n", </font></div>
<div><font face="'courier new', monospace"> '<?xml version="1.0" encoding="UTF-8"?>',</font></div><div><font face="'courier new', monospace"> encode_utf8( $doc->as_XML );</font></div>
<div><br></div><div><br></div><div>That seems to work ok. But, then I ended up with a file that had a CDATA section (which happened to hold a snippet of HTML). That's fine, but $doc->as_XML then encoded the entities.</div>
<div><br></div><div>With this source file:</div><div><br></div><div><div><font face="'courier new', monospace">$ cat test.xml</font></div><div><font face="'courier new', monospace"><?xml version="1.0" encoding="UTF-8"?></font></div>
<div><font face="'courier new', monospace"><testResults></font></div><div><font face="'courier new', monospace"> <result></font></div><div><font face="'courier new', monospace"> <content><![CDATA[<strong>this is&nbsp;strong</strong>]]></content></font></div>
<div><font face="'courier new', monospace"> </result></font></div><div><font face="'courier new', monospace"></testResults></font></div></div>
<div><br></div><div>I run through the script I get:</div><div><br></div><div><div><font face="'courier new', monospace">$ cat new.xml</font></div><div><font face="'courier new', monospace"><?xml version="1.0" encoding="UTF-8"?></font></div>
<div><font face="'courier new', monospace"><testResults><result></font></div><div><font face="'courier new', monospace"> <content>&#60;strong&#62;this is&nbsp;strong&#60;/strong&#62;</content></font></div>
<div><font face="'courier new', monospace"> </result></testResults></font></div></div><div><br></div><div>And if I then run *that* file back through the script I get:</div><div><br></div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<div><div><div>undefined entity at line 3, column 40, byte 101 at /usr/lib/perl5/XML/Parser.pm line 187</div>
</div></div></blockquote><div><div><div><br></div><div>which is choking at the &nbsp;</div><div><br></div><div><br></div><div>My questions are:</div><div><br></div><div>1) Is there a better approach to doing this that preserves the CDATA sections?</div>
<div><br></div><div>2) Is there a way to define the &nbsp; entity? I tried adding DTD to defined the &nbsp;, but wasn't able to make the parser happy in my attempts.</div><div><br></div><div><br></div><div><br>
-- <br>Bill Moseley<br>
<a href="mailto:moseley@hank.org" target="_blank">moseley@hank.org</a><br>
</div></div></div>