[Purdue-pm] XML + Perl

Bradley Andersen bradley.d.andersen at gmail.com
Thu Feb 2 08:15:46 PST 2012


So I need to convert a 700 mb XML file to MS SQL.

Try 1 - I tried using http://xml2sql.sourceforge.net/, but it calls
XML::Parser, and XML::Parser dies upon finding the first invalid
element (as per the XML standard).
The problem is, this document has, AFAICT, _thousands_ of invalid XML
elements, as it is 16 _million_ lines long, and first passes have
indicated about 5 errors every thousand lines.

Try 2 - I tried using  HTML tidy.  Problem is, as it traverses this
large document, it takes more and more time to reach points of
failure, and when it fails, the process is killed, so I end up having
to trap errors like so:
 ( tidy -mi -xml 179-TransferredCases.xml &> errs.tidy) & sleep 300; kill $!

See that "sleep 300"? It starts out at 3, then 6, then 12, ... then
... and captures about 6 errors each time before it does.  I have sed
come in after and clean up the bad lines, at this point, simply by
deleting them:
 sed -i.bak -e '$lines' 179-TransferredCases.xml

All done through a script called tidy-sed.pl that I wrote for this.

But that "sleep 300" only got to me to around line 7 million, and now
it takes too long to trap even one error.

The plan was I would eliminate the bad elements that were killing me
in 'Try 1' so I could then use xml2mssql.

Try 3 - Now I started looking at some things I had earlier discounted,
like XML::Twig (to effectively rip the file into smaller pieces for
faster processing).  But every one of them simply dies on first
invalid element, per the XML standard.

What I really want (I think) is to run one of these parsers over the
file, have it _not_ die on hitting _any_ invalid elements, and simply
in stead provide me with a list of all the bad elements so that I can
remove them, achieve a valid XML document, and then process it
according to xml2mssql.

This is actually my first real try at processing XML, and I would like
to _not_ have to spin something on my own.  What am I missing please?

Thank you!


More information about the Purdue-pm mailing list