Hi Francisco,<div><br></div><div>This looks nice and clean.  Thanks.</div><div><br></div><div>One question I have is what if you are asked to copy in the &lt;result&gt; element but don&#39;t know that some child element might have a cdata section?  I guess one approach would be to use a SAX parser and look for XML_CDATA_SECTION_NODE element.  I&#39;ll give that a try tomorrow.  </div>


<div><br></div><div>Thanks,<br><div><br></div><div><div class="gmail_quote">On Fri, Mar 4, 2011 at 8:28 PM, Francisco Obispo <span dir="ltr">&lt;<a href="mailto:fobispo@isc.org">fobispo@isc.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


I&#39;m a big fan of XML, I think it eases the management of data, and solves a lot of problems.<br>

<br>

&amp;nbsp; is a HTML entity. HTML ignores additional spaces, so it&#39;s required to add more. In XML, data is preserved, so there&#39;s no need for it.<br>

<br>

I wrote a simple script using XML::LibXML and XML::Writer to achieve what you wanted:<br>

<br>

#!/usr/bin/env perl<br>

use strict;<br>

use XML::LibXML;<br>

use XML::Writer;<br>

<br>

my $parser = XML::LibXML-&gt;new;<br>

<br>

my $writer = XML::Writer-&gt;new();<br>

<br>

$writer-&gt;xmlDecl(&#39;UTF-8&#39;);<br>

$writer-&gt;startTag(&#39;testResults&#39;);<br>

<br>

foreach my $file (@ARGV) {<br>

    my $doc = $parser-&gt;parse_file($file);<br>

    my (@nodes) = $doc-&gt;getElementsByTagName(q{content});<br>

<br>

    foreach my $node (@nodes) {<br>

        $writer-&gt;startTag(&#39;result&#39;);<br>

        $writer-&gt;startTag(&#39;content&#39;);<br>

        $writer-&gt;cdata( $node-&gt;textContent );<br>

        $writer-&gt;endTag();<br>

        $writer-&gt;endTag;<br>

    }<br>

}<br>

<br>

$writer-&gt;endTag;<br>

$writer-&gt;end;<br>

<div><div></div><div class="h5"><br>

<br>

<br>

<br>

<br>

<br>

<br>

On Mar 4, 2011, at 5:45 PM, Bill Moseley wrote:<br>

<br>

&gt; Ya, best to avoid XML when possible.  So, I just hacked in this to deal with the entity encoding:<br>

&gt;<br>

&gt; {<br>

&gt;     no warnings &#39;redefine&#39;;<br>

&gt;     sub HTML::Element::_xml_escape {<br>

&gt;         for ( @_ ) {<br>

&gt;             return unless length &amp;&amp; /&lt;/;<br>

&gt;             s{]]&gt;}{]]&amp;#62;}g;<br>

&gt;             $_ = &quot;&lt;![CDATA[$_]]&gt;&quot;;<br>

&gt;         }<br>

&gt;     }<br>

&gt; }<br>

&gt;<br>

&gt; Likely only good for this one-off, but was curious how the &quot;right&quot; way to handle this would be.<br>

&gt;<br>

&gt; On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley &lt;<a href="mailto:moseley@hank.org">moseley@hank.org</a>&gt; wrote:<br>

&gt; I have a collection of XML files that have one or more &lt;result&gt; elements in each file.  The goal is have a script where I can pass one or more files on the command line which will gather up all the &lt;result&gt; elements from each file and combine into a single XML output file.<br>


&gt;<br>

&gt; I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick scripts) and did the following.  I seem to not work with XML that much (which I think is a bit lucky), so there may be easier ways to do this.<br>


&gt;<br>

&gt; #!/usr/bin/perl<br>

&gt; use strict;<br>

&gt; use warnings;<br>

&gt; use XML::TreeBuilder;<br>

&gt; use XML::Element;<br>

&gt; use Encode;<br>

&gt;<br>

&gt;<br>

&gt; my $doc = XML::Element-&gt;new( &#39;testResults&#39; );<br>

&gt;<br>

&gt; for my $path ( @ARGV ) {<br>

&gt;     my $tree = XML::TreeBuilder-&gt;new;<br>

&gt;     $tree-&gt;parse_file( $path );<br>

&gt;<br>

&gt;     $doc-&gt;push_content( $tree-&gt;look_down( &#39;_tag&#39;, &#39;result&#39; ) );<br>

&gt;<br>

&gt;     $tree-&gt;delete;<br>

&gt; }<br>

&gt;<br>

&gt; print join &quot;\n&quot;,<br>

&gt;     &#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&#39;,<br>

&gt;     encode_utf8( $doc-&gt;as_XML );<br>

&gt;<br>

&gt;<br>

&gt; That seems to work ok.  But, then I ended up with a file that had a CDATA section (which happened to hold a snippet of HTML).  That&#39;s fine, but $doc-&gt;as_XML then encoded the entities.<br>

&gt;<br>

&gt; With this source file:<br>

&gt;<br>

&gt; $ cat test.xml<br>

&gt; &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;<br>

&gt; &lt;testResults&gt;<br>

&gt;     &lt;result&gt;<br>

&gt;         &lt;content&gt;&lt;![CDATA[&lt;strong&gt;this is&amp;nbsp;strong&lt;/strong&gt;]]&gt;&lt;/content&gt;<br>

&gt;     &lt;/result&gt;<br>

&gt; &lt;/testResults&gt;<br>

&gt;<br>

&gt; I run through the script I get:<br>

&gt;<br>

&gt; $ cat new.xml<br>

&gt; &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;<br>

&gt; &lt;testResults&gt;&lt;result&gt;<br>

&gt;         &lt;content&gt;&amp;#60;strong&amp;#62;this is&amp;nbsp;strong&amp;#60;/strong&amp;#62;&lt;/content&gt;<br>

&gt;     &lt;/result&gt;&lt;/testResults&gt;<br>

&gt;<br>

&gt; And if I then run *that* file back through the script I get:<br>

&gt;<br>

&gt; undefined entity at line 3, column 40, byte 101 at /usr/lib/perl5/XML/Parser.pm line 187<br>

&gt;<br>

&gt; which is choking at the &amp;nbsp;<br>

&gt;<br>

&gt;<br>

&gt; My questions are:<br>

&gt;<br>

&gt; 1) Is there a better approach to doing this that preserves the CDATA sections?<br>

&gt;<br>

&gt; 2) Is there a way to define the &amp;nbsp; entity?  I tried adding DTD to defined the &amp;nbsp;, but wasn&#39;t able to make the parser happy in my attempts.<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; --<br>

&gt; Bill Moseley<br>

&gt; <a href="mailto:moseley@hank.org">moseley@hank.org</a><br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; --<br>

&gt; Bill Moseley<br>

&gt; <a href="mailto:moseley@hank.org">moseley@hank.org</a><br>

</div></div>&gt; _______________________________________________<br>

&gt; SanFrancisco-pm mailing list<br>

&gt; <a href="mailto:SanFrancisco-pm@pm.org">SanFrancisco-pm@pm.org</a><br>

&gt; <a href="http://mail.pm.org/mailman/listinfo/sanfrancisco-pm" target="_blank">http://mail.pm.org/mailman/listinfo/sanfrancisco-pm</a><br>

<font color="#888888"><br>

Francisco Obispo<br>

Hosted@ Programme Manager<br>

email: <a href="mailto:fobispo@isc.org">fobispo@isc.org</a><br>

Phone: <a href="tel:%2B1%20650%20423%201374">+1 650 423 1374</a> || INOC-DBA *3557* NOC<br>

Key fingerprint = 532F 84EB 06B4 3806 D5FA  09C6 463E 614E B38D B1BE<br>

<br>

<br>

<br>

<br>

</font></blockquote></div><br><br clear="all"><br>-- <br>Bill Moseley<br><a href="mailto:moseley@hank.org" target="_blank">moseley@hank.org</a><br>

</div></div>