[Purdue-pm] SOLVED -- Re: XML + Perl

Bradley Andersen bradley.d.andersen at gmail.com
Wed Feb 15 09:17:17 PST 2012


I actually hadn't looked at this again until this morning.

The problem is, there are supposed to be 179 nodes of interest over 16
million+ lines, so I was not able to easily determine the structure,
as editors seemed to die trying to read it.

I sat down this morning and split the file into 160 parts (files) of
size ~ 100,000 lines, looked at the first and last parts (files 1 and
160), and I think I have the structure now.  I'll run some random
sample of the other chunks to verify.  But I think it is easily solved
now without XML parsers.

Just FYI in case anyone is interested :)




On Thu, Feb 2, 2012 at 4:23 PM, Bradley Andersen
<bradley.d.andersen at gmail.com> wrote:
> The perlmongers link led me to xmllint:
>    xmllint -recover bak.179-TransferredCases.xml --output 179T.xml
>
> -recover tells xmllint to keep the <valid> and throw away the
> <invalid>, and, well, --output seems self-explanatory.
>
> But then look at this:
>   bradley at pvnp:~/x2s/xml2sql$ wc -l ../179T.xml
>   2152938 ../179T.xml
>
> So there's 2152938 valid lines, right?
>
> Not so fast:
>    bradley at pvnp:~/x2s/xml2sql$ xml2mssql.pl < ../179T.xml > 179T.sql
>
>    not well-formed (invalid token) at line 2927058, column 59, byte
> 115008246 at /usr/local/lib/perl5/XML/Parser.pm
>
> WHAT??!!
>
> So xml2mssql found an invalid token on line 2927058 of a 2152938-line file ...
>
>
>
>
>
>
> On Thu, Feb 2, 2012 at 12:45 PM, Joe Kline <gizmo at purdue.edu> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Well, my first thought was one of the "tree" parsers that walk the doc
>> tree so it doesn't have to parse the whole thing.
>>
>> Here's a Perl Monks node for dealing with invalid XML characters that
>> might point towards some ideas:
>>
>> http://www.perlmonks.org/?node_id=752527
>>
>> There's always stackexchange to see this has been asked before, and if
>> not you should get some suggestions rather quickly.
>>
>> Maybe XML::SAX?
>>
>> joe
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v2.0.14 (GNU/Linux)
>> Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAk8qy84ACgkQb0mzA2gRTpk0NQCfaszn3n70v1hhEZXGluhRndCA
>> /VwAnAs7r1p+kcxhCnvmGC1Q69MbM9gA
>> =ZGHO
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> Purdue-pm mailing list
>> Purdue-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/purdue-pm


More information about the Purdue-pm mailing list