Hi Francisco,<div><br></div><div>This looks nice and clean. Thanks.</div><div><br></div><div>One question I have is what if you are asked to copy in the <result> element but don't know that some child element might have a cdata section? I guess one approach would be to use a SAX parser and look for XML_CDATA_SECTION_NODE element. I'll give that a try tomorrow. </div>
<div><br></div><div>Thanks,<br><div><br></div><div><div class="gmail_quote">On Fri, Mar 4, 2011 at 8:28 PM, Francisco Obispo <span dir="ltr"><<a href="mailto:fobispo@isc.org">fobispo@isc.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I'm a big fan of XML, I think it eases the management of data, and solves a lot of problems.<br>
<br>
&nbsp; is a HTML entity. HTML ignores additional spaces, so it's required to add more. In XML, data is preserved, so there's no need for it.<br>
<br>
I wrote a simple script using XML::LibXML and XML::Writer to achieve what you wanted:<br>
<br>
#!/usr/bin/env perl<br>
use strict;<br>
use XML::LibXML;<br>
use XML::Writer;<br>
<br>
my $parser = XML::LibXML->new;<br>
<br>
my $writer = XML::Writer->new();<br>
<br>
$writer->xmlDecl('UTF-8');<br>
$writer->startTag('testResults');<br>
<br>
foreach my $file (@ARGV) {<br>
my $doc = $parser->parse_file($file);<br>
my (@nodes) = $doc->getElementsByTagName(q{content});<br>
<br>
foreach my $node (@nodes) {<br>
$writer->startTag('result');<br>
$writer->startTag('content');<br>
$writer->cdata( $node->textContent );<br>
$writer->endTag();<br>
$writer->endTag;<br>
}<br>
}<br>
<br>
$writer->endTag;<br>
$writer->end;<br>
<div><div></div><div class="h5"><br>
<br>
<br>
<br>
<br>
<br>
<br>
On Mar 4, 2011, at 5:45 PM, Bill Moseley wrote:<br>
<br>
> Ya, best to avoid XML when possible. So, I just hacked in this to deal with the entity encoding:<br>
><br>
> {<br>
> no warnings 'redefine';<br>
> sub HTML::Element::_xml_escape {<br>
> for ( @_ ) {<br>
> return unless length && /</;<br>
> s{]]>}{]]&#62;}g;<br>
> $_ = "<![CDATA[$_]]>";<br>
> }<br>
> }<br>
> }<br>
><br>
> Likely only good for this one-off, but was curious how the "right" way to handle this would be.<br>
><br>
> On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <<a href="mailto:moseley@hank.org">moseley@hank.org</a>> wrote:<br>
> I have a collection of XML files that have one or more <result> elements in each file. The goal is have a script where I can pass one or more files on the command line which will gather up all the <result> elements from each file and combine into a single XML output file.<br>
><br>
> I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick scripts) and did the following. I seem to not work with XML that much (which I think is a bit lucky), so there may be easier ways to do this.<br>
><br>
> #!/usr/bin/perl<br>
> use strict;<br>
> use warnings;<br>
> use XML::TreeBuilder;<br>
> use XML::Element;<br>
> use Encode;<br>
><br>
><br>
> my $doc = XML::Element->new( 'testResults' );<br>
><br>
> for my $path ( @ARGV ) {<br>
> my $tree = XML::TreeBuilder->new;<br>
> $tree->parse_file( $path );<br>
><br>
> $doc->push_content( $tree->look_down( '_tag', 'result' ) );<br>
><br>
> $tree->delete;<br>
> }<br>
><br>
> print join "\n",<br>
> '<?xml version="1.0" encoding="UTF-8"?>',<br>
> encode_utf8( $doc->as_XML );<br>
><br>
><br>
> That seems to work ok. But, then I ended up with a file that had a CDATA section (which happened to hold a snippet of HTML). That's fine, but $doc->as_XML then encoded the entities.<br>
><br>
> With this source file:<br>
><br>
> $ cat test.xml<br>
> <?xml version="1.0" encoding="UTF-8"?><br>
> <testResults><br>
> <result><br>
> <content><![CDATA[<strong>this is&nbsp;strong</strong>]]></content><br>
> </result><br>
> </testResults><br>
><br>
> I run through the script I get:<br>
><br>
> $ cat new.xml<br>
> <?xml version="1.0" encoding="UTF-8"?><br>
> <testResults><result><br>
> <content>&#60;strong&#62;this is&nbsp;strong&#60;/strong&#62;</content><br>
> </result></testResults><br>
><br>
> And if I then run *that* file back through the script I get:<br>
><br>
> undefined entity at line 3, column 40, byte 101 at /usr/lib/perl5/XML/Parser.pm line 187<br>
><br>
> which is choking at the &nbsp;<br>
><br>
><br>
> My questions are:<br>
><br>
> 1) Is there a better approach to doing this that preserves the CDATA sections?<br>
><br>
> 2) Is there a way to define the &nbsp; entity? I tried adding DTD to defined the &nbsp;, but wasn't able to make the parser happy in my attempts.<br>
><br>
><br>
><br>
> --<br>
> Bill Moseley<br>
> <a href="mailto:moseley@hank.org">moseley@hank.org</a><br>
><br>
><br>
><br>
> --<br>
> Bill Moseley<br>
> <a href="mailto:moseley@hank.org">moseley@hank.org</a><br>
</div></div>> _______________________________________________<br>
> SanFrancisco-pm mailing list<br>
> <a href="mailto:SanFrancisco-pm@pm.org">SanFrancisco-pm@pm.org</a><br>
> <a href="http://mail.pm.org/mailman/listinfo/sanfrancisco-pm" target="_blank">http://mail.pm.org/mailman/listinfo/sanfrancisco-pm</a><br>
<font color="#888888"><br>
Francisco Obispo<br>
Hosted@ Programme Manager<br>
email: <a href="mailto:fobispo@isc.org">fobispo@isc.org</a><br>
Phone: <a href="tel:%2B1%20650%20423%201374">+1 650 423 1374</a> || INOC-DBA *3557* NOC<br>
Key fingerprint = 532F 84EB 06B4 3806 D5FA 09C6 463E 614E B38D B1BE<br>
<br>
<br>
<br>
<br>
</font></blockquote></div><br><br clear="all"><br>-- <br>Bill Moseley<br><a href="mailto:moseley@hank.org" target="_blank">moseley@hank.org</a><br>
</div></div>