[Purdue-pm] XML + Perl

Bradley Andersen bradley.d.andersen at gmail.com
Thu Feb 2 13:23:31 PST 2012


The perlmongers link led me to xmllint:
    xmllint -recover bak.179-TransferredCases.xml --output 179T.xml

-recover tells xmllint to keep the <valid> and throw away the
<invalid>, and, well, --output seems self-explanatory.

But then look at this:
   bradley at pvnp:~/x2s/xml2sql$ wc -l ../179T.xml
   2152938 ../179T.xml

So there's 2152938 valid lines, right?

Not so fast:
    bradley at pvnp:~/x2s/xml2sql$ xml2mssql.pl < ../179T.xml > 179T.sql

    not well-formed (invalid token) at line 2927058, column 59, byte
115008246 at /usr/local/lib/perl5/XML/Parser.pm

WHAT??!!

So xml2mssql found an invalid token on line 2927058 of a 2152938-line file ...






On Thu, Feb 2, 2012 at 12:45 PM, Joe Kline <gizmo at purdue.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Well, my first thought was one of the "tree" parsers that walk the doc
> tree so it doesn't have to parse the whole thing.
>
> Here's a Perl Monks node for dealing with invalid XML characters that
> might point towards some ideas:
>
> http://www.perlmonks.org/?node_id=752527
>
> There's always stackexchange to see this has been asked before, and if
> not you should get some suggestions rather quickly.
>
> Maybe XML::SAX?
>
> joe
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (GNU/Linux)
> Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk8qy84ACgkQb0mzA2gRTpk0NQCfaszn3n70v1hhEZXGluhRndCA
> /VwAnAs7r1p+kcxhCnvmGC1Q69MbM9gA
> =ZGHO
> -----END PGP SIGNATURE-----
> _______________________________________________
> Purdue-pm mailing list
> Purdue-pm at pm.org
> http://mail.pm.org/mailman/listinfo/purdue-pm


More information about the Purdue-pm mailing list