[sf-perl] XML Parsing question
Francisco Obispo
fobispo at isc.org
Fri Mar 4 20:28:09 PST 2011
I'm a big fan of XML, I think it eases the management of data, and solves a lot of problems.
is a HTML entity. HTML ignores additional spaces, so it's required to add more. In XML, data is preserved, so there's no need for it.
I wrote a simple script using XML::LibXML and XML::Writer to achieve what you wanted:
#!/usr/bin/env perl
use strict;
use XML::LibXML;
use XML::Writer;
my $parser = XML::LibXML->new;
my $writer = XML::Writer->new();
$writer->xmlDecl('UTF-8');
$writer->startTag('testResults');
foreach my $file (@ARGV) {
my $doc = $parser->parse_file($file);
my (@nodes) = $doc->getElementsByTagName(q{content});
foreach my $node (@nodes) {
$writer->startTag('result');
$writer->startTag('content');
$writer->cdata( $node->textContent );
$writer->endTag();
$writer->endTag;
}
}
$writer->endTag;
$writer->end;
On Mar 4, 2011, at 5:45 PM, Bill Moseley wrote:
> Ya, best to avoid XML when possible. So, I just hacked in this to deal with the entity encoding:
>
> {
> no warnings 'redefine';
> sub HTML::Element::_xml_escape {
> for ( @_ ) {
> return unless length && /</;
> s{]]>}{]]>}g;
> $_ = "<![CDATA[$_]]>";
> }
> }
> }
>
> Likely only good for this one-off, but was curious how the "right" way to handle this would be.
>
> On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <moseley at hank.org> wrote:
> I have a collection of XML files that have one or more <result> elements in each file. The goal is have a script where I can pass one or more files on the command line which will gather up all the <result> elements from each file and combine into a single XML output file.
>
> I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick scripts) and did the following. I seem to not work with XML that much (which I think is a bit lucky), so there may be easier ways to do this.
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use XML::TreeBuilder;
> use XML::Element;
> use Encode;
>
>
> my $doc = XML::Element->new( 'testResults' );
>
> for my $path ( @ARGV ) {
> my $tree = XML::TreeBuilder->new;
> $tree->parse_file( $path );
>
> $doc->push_content( $tree->look_down( '_tag', 'result' ) );
>
> $tree->delete;
> }
>
> print join "\n",
> '<?xml version="1.0" encoding="UTF-8"?>',
> encode_utf8( $doc->as_XML );
>
>
> That seems to work ok. But, then I ended up with a file that had a CDATA section (which happened to hold a snippet of HTML). That's fine, but $doc->as_XML then encoded the entities.
>
> With this source file:
>
> $ cat test.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults>
> <result>
> <content><![CDATA[<strong>this is strong</strong>]]></content>
> </result>
> </testResults>
>
> I run through the script I get:
>
> $ cat new.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults><result>
> <content><strong>this is strong</strong></content>
> </result></testResults>
>
> And if I then run *that* file back through the script I get:
>
> undefined entity at line 3, column 40, byte 101 at /usr/lib/perl5/XML/Parser.pm line 187
>
> which is choking at the
>
>
> My questions are:
>
> 1) Is there a better approach to doing this that preserves the CDATA sections?
>
> 2) Is there a way to define the entity? I tried adding DTD to defined the , but wasn't able to make the parser happy in my attempts.
>
>
>
> --
> Bill Moseley
> moseley at hank.org
>
>
>
> --
> Bill Moseley
> moseley at hank.org
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
Francisco Obispo
Hosted@ Programme Manager
email: fobispo at isc.org
Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
Key fingerprint = 532F 84EB 06B4 3806 D5FA 09C6 463E 614E B38D B1BE
More information about the SanFrancisco-pm
mailing list