[sf-perl] XML Parsing question
Bill Moseley
moseley at hank.org
Fri Mar 4 17:45:02 PST 2011
Ya, best to avoid XML when possible. So, I just hacked in this to deal with
the entity encoding:
{
no warnings 'redefine';
sub HTML::Element::_xml_escape {
for ( @_ ) {
return unless length && /</;
s{]]>}{]]>}g;
$_ = "<![CDATA[$_]]>";
}
}
}
Likely only good for this one-off, but was curious how the "right" way to
handle this would be.
On Thu, Mar 3, 2011 at 3:41 PM, Bill Moseley <moseley at hank.org> wrote:
> I have a collection of XML files that have one or more <result> elements in
> each file. The goal is have a script where I can pass one or more files on
> the command line which will gather up all the <result> elements from each
> file and combine into a single XML output file.
>
> I pulled XML::TreeBuilder out (I find TreeBuilder pretty easy for quick
> scripts) and did the following. I seem to not work with XML that much
> (which I think is a bit lucky), so there may be easier ways to do this.
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use XML::TreeBuilder;
> use XML::Element;
> use Encode;
>
>
> my $doc = XML::Element->new( 'testResults' );
>
> for my $path ( @ARGV ) {
> my $tree = XML::TreeBuilder->new;
> $tree->parse_file( $path );
>
> $doc->push_content( $tree->look_down( '_tag', 'result' ) );
>
> $tree->delete;
> }
>
> print join "\n",
> '<?xml version="1.0" encoding="UTF-8"?>',
> encode_utf8( $doc->as_XML );
>
>
> That seems to work ok. But, then I ended up with a file that had a CDATA
> section (which happened to hold a snippet of HTML). That's fine, but
> $doc->as_XML then encoded the entities.
>
> With this source file:
>
> $ cat test.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults>
> <result>
> <content><![CDATA[<strong>this is strong</strong>]]></content>
> </result>
> </testResults>
>
> I run through the script I get:
>
> $ cat new.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <testResults><result>
> <content><strong>this
> is strong</strong></content>
> </result></testResults>
>
> And if I then run *that* file back through the script I get:
>
> undefined entity at line 3, column 40, byte 101 at
> /usr/lib/perl5/XML/Parser.pm line 187
>
>
> which is choking at the
>
>
> My questions are:
>
> 1) Is there a better approach to doing this that preserves the CDATA
> sections?
>
> 2) Is there a way to define the entity? I tried adding DTD to
> defined the , but wasn't able to make the parser happy in my attempts.
>
>
>
> --
> Bill Moseley
> moseley at hank.org
>
--
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/sanfrancisco-pm/attachments/20110304/97e16139/attachment.html>
More information about the SanFrancisco-pm
mailing list