[sf-perl] XML Parsing question

Bill Moseley moseley at hank.org
Fri Mar 4 21:44:57 PST 2011


Hi Francisco,

This looks nice and clean.  Thanks.

One question I have is what if you are asked to copy in the <result> element
but don't know that some child element might have a cdata section?  I guess
one approach would be to use a SAX parser and look
for XML_CDATA_SECTION_NODE element.  I'll give that a try tomorrow.

Thanks,

On Fri, Mar 4, 2011 at 8:28 PM, Francisco Obispo <fobispo at isc.org> wrote:

> I'm a big fan of XML, I think it eases the management of data, and solves a
> lot of problems.
>
> &nbsp; is a HTML entity. HTML ignores additional spaces, so it's required
> to add more. In XML, data is preserved, so there's no need for it.
>
> I wrote a simple script using XML::LibXML and XML::Writer to achieve what
> you wanted:
>
> #!/usr/bin/env perl
> use strict;
> use XML::LibXML;
> use XML::Writer;
>
> my $parser = XML::LibXML->new;
>
> my $writer = XML::Writer->new();
>
> $writer->xmlDecl('UTF-8');
> $writer->startTag('testResults');
>
> foreach my $file (@ARGV) {
>    my $doc = $parser->parse_file($file);
>    my (@nodes) = $doc->getElementsByTagName(q{content});
>
>    foreach my $node (@nodes) {
>        $writer->startTag('result');
>        $writer->startTag('content');
>        $writer->cdata( $node->textContent );
>        $writer->endTag();
>        $writer->endTag;
>    }
> }
>
> $writer->endTag;
> $writer->end;
>
>
>
>
>
>
>
> On Mar 4, 2011, at 5:45 PM, Bill Moseley wrote:
>
> > Ya, best to avoid XML when possible.  So, I just hacked in this to deal
> with the entity encoding:
> >
> > {
> >     no warnings 'redefine';
> >     sub HTML::Element::_xml_escape {
> >         for ( @_ ) {
> >             return unless length && /</;
> >             s{]]>}{]]&#62;}g;
> >             $_ = "<![CDATA[$_]]>";
> >         }
> >     }
> > }
> >
> > Likely only good for this one-off, but was curious how the "right" way to
> handle this would be.
> >
> > On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <moseley at hank.org> wrote:
> > I have a collection of XML files that have one or more <result> elements
> in each file.  The goal is have a script where I can pass one or more files
> on the command line which will gather up all the <result> elements from each
> file and combine into a single XML output file.
> >
> > I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick
> scripts) and did the following.  I seem to not work with XML that much
> (which I think is a bit lucky), so there may be easier ways to do this.
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> > use XML::TreeBuilder;
> > use XML::Element;
> > use Encode;
> >
> >
> > my $doc = XML::Element->new( 'testResults' );
> >
> > for my $path ( @ARGV ) {
> >     my $tree = XML::TreeBuilder->new;
> >     $tree->parse_file( $path );
> >
> >     $doc->push_content( $tree->look_down( '_tag', 'result' ) );
> >
> >     $tree->delete;
> > }
> >
> > print join "\n",
> >     '<?xml version="1.0" encoding="UTF-8"?>',
> >     encode_utf8( $doc->as_XML );
> >
> >
> > That seems to work ok.  But, then I ended up with a file that had a CDATA
> section (which happened to hold a snippet of HTML).  That's fine, but
> $doc->as_XML then encoded the entities.
> >
> > With this source file:
> >
> > $ cat test.xml
> > <?xml version="1.0" encoding="UTF-8"?>
> > <testResults>
> >     <result>
> >         <content><![CDATA[<strong>this
> is&nbsp;strong</strong>]]></content>
> >     </result>
> > </testResults>
> >
> > I run through the script I get:
> >
> > $ cat new.xml
> > <?xml version="1.0" encoding="UTF-8"?>
> > <testResults><result>
> >         <content>&#60;strong&#62;this
> is&nbsp;strong&#60;/strong&#62;</content>
> >     </result></testResults>
> >
> > And if I then run *that* file back through the script I get:
> >
> > undefined entity at line 3, column 40, byte 101 at
> /usr/lib/perl5/XML/Parser.pm line 187
> >
> > which is choking at the &nbsp;
> >
> >
> > My questions are:
> >
> > 1) Is there a better approach to doing this that preserves the CDATA
> sections?
> >
> > 2) Is there a way to define the &nbsp; entity?  I tried adding DTD to
> defined the &nbsp;, but wasn't able to make the parser happy in my attempts.
> >
> >
> >
> > --
> > Bill Moseley
> > moseley at hank.org
> >
> >
> >
> > --
> > Bill Moseley
> > moseley at hank.org
> > _______________________________________________
> > SanFrancisco-pm mailing list
> > SanFrancisco-pm at pm.org
> > http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>
> Francisco Obispo
> Hosted@ Programme Manager
> email: fobispo at isc.org
> Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
> Key fingerprint = 532F 84EB 06B4 3806 D5FA  09C6 463E 614E B38D B1BE
>
>
>
>
>


-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/sanfrancisco-pm/attachments/20110304/c4002bbd/attachment.html>


More information about the SanFrancisco-pm mailing list