[sf-perl] XML Parsing question

Francisco Obispo fobispo at isc.org
Fri Mar 4 20:28:09 PST 2011


I'm a big fan of XML, I think it eases the management of data, and solves a lot of problems.

  is a HTML entity. HTML ignores additional spaces, so it's required to add more. In XML, data is preserved, so there's no need for it.

I wrote a simple script using XML::LibXML and XML::Writer to achieve what you wanted:

#!/usr/bin/env perl
use strict;
use XML::LibXML;
use XML::Writer;

my $parser = XML::LibXML->new;

my $writer = XML::Writer->new();

$writer->xmlDecl('UTF-8');
$writer->startTag('testResults');

foreach my $file (@ARGV) {
    my $doc = $parser->parse_file($file);
    my (@nodes) = $doc->getElementsByTagName(q{content});

    foreach my $node (@nodes) {
        $writer->startTag('result');
        $writer->startTag('content');
        $writer->cdata( $node->textContent );
        $writer->endTag();
        $writer->endTag;
    }
}

$writer->endTag;
$writer->end;







On Mar 4, 2011, at 5:45 PM, Bill Moseley wrote:

> Ya, best to avoid XML when possible.  So, I just hacked in this to deal with the entity encoding:
> 
> {
>     no warnings 'redefine';
>     sub HTML::Element::_xml_escape {
>         for ( @_ ) {
>             return unless length && /</;
>             s{]]>}{]]&#62;}g;
>             $_ = "<![CDATA[$_]]>";
>         }
>     }
> }
> 
> Likely only good for this one-off, but was curious how the "right" way to handle this would be.
> 
> On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <moseley at hank.org> wrote:
> I have a collection of XML files that have one or more <result> elements in each file.  The goal is have a script where I can pass one or more files on the command line which will gather up all the <result> elements from each file and combine into a single XML output file.
> 
> I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick scripts) and did the following.  I seem to not work with XML that much (which I think is a bit lucky), so there may be easier ways to do this.
> 
> #!/usr/bin/perl
> use strict;
> use warnings;
> use XML::TreeBuilder;
> use XML::Element;
> use Encode;
> 
> 
> my $doc = XML::Element->new( 'testResults' );
> 
> for my $path ( @ARGV ) {
>     my $tree = XML::TreeBuilder->new;
>     $tree->parse_file( $path );
> 
>     $doc->push_content( $tree->look_down( '_tag', 'result' ) );
> 
>     $tree->delete;
> }
> 
> print join "\n", 
>     '<?xml version="1.0" encoding="UTF-8"?>',
>     encode_utf8( $doc->as_XML );
> 
> 
> That seems to work ok.  But, then I ended up with a file that had a CDATA section (which happened to hold a snippet of HTML).  That's fine, but $doc->as_XML then encoded the entities.
> 
> With this source file:
> 
> $ cat test.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults>
>     <result>
>         <content><![CDATA[<strong>this is&nbsp;strong</strong>]]></content>
>     </result>
> </testResults>
> 
> I run through the script I get:
> 
> $ cat new.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults><result>
>         <content>&#60;strong&#62;this is&nbsp;strong&#60;/strong&#62;</content>
>     </result></testResults>
> 
> And if I then run *that* file back through the script I get:
> 
> undefined entity at line 3, column 40, byte 101 at /usr/lib/perl5/XML/Parser.pm line 187
> 
> which is choking at the &nbsp;
> 
> 
> My questions are:
> 
> 1) Is there a better approach to doing this that preserves the CDATA sections?
> 
> 2) Is there a way to define the &nbsp; entity?  I tried adding DTD to defined the &nbsp;, but wasn't able to make the parser happy in my attempts.
> 
> 
> 
> -- 
> Bill Moseley
> moseley at hank.org
> 
> 
> 
> -- 
> Bill Moseley
> moseley at hank.org
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm

Francisco Obispo 
Hosted@ Programme Manager
email: fobispo at isc.org
Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
Key fingerprint = 532F 84EB 06B4 3806 D5FA  09C6 463E 614E B38D B1BE






More information about the SanFrancisco-pm mailing list