[Purdue-pm] XML + Perl
westerman at purdue.edu
Thu Feb 2 08:30:35 PST 2012
All of the times I have tried parsing XML with a Perl-based parser they have been picky about the format. As you said,
"... dies upon finding the first invalid element (as per the XML standard) ..."
Your idea of making the file compatible is a good one. However I suspect that it will take some gnarly custom code to do so. At which point you might as well parse and write directly to SQL.
----- Original Message -----
> So I need to convert a 700 mb XML file to MS SQL.
> Try 1 - I tried using http://xml2sql.sourceforge.net/, but it calls
> XML::Parser, and XML::Parser dies upon finding the first invalid
> element (as per the XML standard).
> The problem is, this document has, AFAICT, _thousands_ of invalid XML
> elements, as it is 16 _million_ lines long, and first passes have
> indicated about 5 errors every thousand lines.
> Try 2 - I tried using HTML tidy. Problem is, as it traverses this
> large document, it takes more and more time to reach points of
> failure, and when it fails, the process is killed, so I end up having
> to trap errors like so:
> ( tidy -mi -xml 179-TransferredCases.xml &> errs.tidy) & sleep 300;
> kill $!
> See that "sleep 300"? It starts out at 3, then 6, then 12, ... then
> ... and captures about 6 errors each time before it does. I have sed
> come in after and clean up the bad lines, at this point, simply by
> deleting them:
> sed -i.bak -e '$lines' 179-TransferredCases.xml
> All done through a script called tidy-sed.pl that I wrote for this.
> But that "sleep 300" only got to me to around line 7 million, and now
> it takes too long to trap even one error.
> The plan was I would eliminate the bad elements that were killing me
> in 'Try 1' so I could then use xml2mssql.
> Try 3 - Now I started looking at some things I had earlier discounted,
> like XML::Twig (to effectively rip the file into smaller pieces for
> faster processing). But every one of them simply dies on first
> invalid element, per the XML standard.
> What I really want (I think) is to run one of these parsers over the
> file, have it _not_ die on hitting _any_ invalid elements, and simply
> in stead provide me with a list of all the bad elements so that I can
> remove them, achieve a valid XML document, and then process it
> according to xml2mssql.
> This is actually my first real try at processing XML, and I would like
> to _not_ have to spin something on my own. What am I missing please?
> Thank you!
> Purdue-pm mailing list
> Purdue-pm at pm.org
westerman at purdue.edu
Bioinformatics specialist at the Genomics Facility.
Phone: (765) 494-0505 FAX: (765) 496-7255
Department of Horticulture and Landscape Architecture
625 Agriculture Mall Drive
West Lafayette, IN 47907-2010
Physically located in room S049, WSLR building
More information about the Purdue-pm