[sf-perl] XML Parsing question
Bill Moseley
moseley at hank.org
Fri Mar 4 21:44:57 PST 2011
Hi Francisco,
This looks nice and clean. Thanks.
One question I have is what if you are asked to copy in the <result> element
but don't know that some child element might have a cdata section? I guess
one approach would be to use a SAX parser and look
for XML_CDATA_SECTION_NODE element. I'll give that a try tomorrow.
Thanks,
On Fri, Mar 4, 2011 at 8:28 PM, Francisco Obispo <fobispo at isc.org> wrote:
> I'm a big fan of XML, I think it eases the management of data, and solves a
> lot of problems.
>
> is a HTML entity. HTML ignores additional spaces, so it's required
> to add more. In XML, data is preserved, so there's no need for it.
>
> I wrote a simple script using XML::LibXML and XML::Writer to achieve what
> you wanted:
>
> #!/usr/bin/env perl
> use strict;
> use XML::LibXML;
> use XML::Writer;
>
> my $parser = XML::LibXML->new;
>
> my $writer = XML::Writer->new();
>
> $writer->xmlDecl('UTF-8');
> $writer->startTag('testResults');
>
> foreach my $file (@ARGV) {
> my $doc = $parser->parse_file($file);
> my (@nodes) = $doc->getElementsByTagName(q{content});
>
> foreach my $node (@nodes) {
> $writer->startTag('result');
> $writer->startTag('content');
> $writer->cdata( $node->textContent );
> $writer->endTag();
> $writer->endTag;
> }
> }
>
> $writer->endTag;
> $writer->end;
>
>
>
>
>
>
>
> On Mar 4, 2011, at 5:45 PM, Bill Moseley wrote:
>
> > Ya, best to avoid XML when possible. So, I just hacked in this to deal
> with the entity encoding:
> >
> > {
> > no warnings 'redefine';
> > sub HTML::Element::_xml_escape {
> > for ( @_ ) {
> > return unless length && /</;
> > s{]]>}{]]>}g;
> > $_ = "<![CDATA[$_]]>";
> > }
> > }
> > }
> >
> > Likely only good for this one-off, but was curious how the "right" way to
> handle this would be.
> >
> > On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <moseley at hank.org> wrote:
> > I have a collection of XML files that have one or more <result> elements
> in each file. The goal is have a script where I can pass one or more files
> on the command line which will gather up all the <result> elements from each
> file and combine into a single XML output file.
> >
> > I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick
> scripts) and did the following. I seem to not work with XML that much
> (which I think is a bit lucky), so there may be easier ways to do this.
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> > use XML::TreeBuilder;
> > use XML::Element;
> > use Encode;
> >
> >
> > my $doc = XML::Element->new( 'testResults' );
> >
> > for my $path ( @ARGV ) {
> > my $tree = XML::TreeBuilder->new;
> > $tree->parse_file( $path );
> >
> > $doc->push_content( $tree->look_down( '_tag', 'result' ) );
> >
> > $tree->delete;
> > }
> >
> > print join "\n",
> > '<?xml version="1.0" encoding="UTF-8"?>',
> > encode_utf8( $doc->as_XML );
> >
> >
> > That seems to work ok. But, then I ended up with a file that had a CDATA
> section (which happened to hold a snippet of HTML). That's fine, but
> $doc->as_XML then encoded the entities.
> >
> > With this source file:
> >
> > $ cat test.xml
> > <?xml version="1.0" encoding="UTF-8"?>
> > <testResults>
> > <result>
> > <content><![CDATA[<strong>this
> is strong</strong>]]></content>
> > </result>
> > </testResults>
> >
> > I run through the script I get:
> >
> > $ cat new.xml
> > <?xml version="1.0" encoding="UTF-8"?>
> > <testResults><result>
> > <content><strong>this
> is strong</strong></content>
> > </result></testResults>
> >
> > And if I then run *that* file back through the script I get:
> >
> > undefined entity at line 3, column 40, byte 101 at
> /usr/lib/perl5/XML/Parser.pm line 187
> >
> > which is choking at the
> >
> >
> > My questions are:
> >
> > 1) Is there a better approach to doing this that preserves the CDATA
> sections?
> >
> > 2) Is there a way to define the entity? I tried adding DTD to
> defined the , but wasn't able to make the parser happy in my attempts.
> >
> >
> >
> > --
> > Bill Moseley
> > moseley at hank.org
> >
> >
> >
> > --
> > Bill Moseley
> > moseley at hank.org
> > _______________________________________________
> > SanFrancisco-pm mailing list
> > SanFrancisco-pm at pm.org
> > http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>
> Francisco Obispo
> Hosted@ Programme Manager
> email: fobispo at isc.org
> Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
> Key fingerprint = 532F 84EB 06B4 3806 D5FA 09C6 463E 614E B38D B1BE
>
>
>
>
>
--
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/sanfrancisco-pm/attachments/20110304/c4002bbd/attachment.html>
More information about the SanFrancisco-pm
mailing list