From bradley.d.andersen at gmail.com  Thu Feb  2 08:15:46 2012
From: bradley.d.andersen at gmail.com (Bradley Andersen)
Date: Thu, 2 Feb 2012 11:15:46 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
Message-ID: <CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>

So I need to convert a 700 mb XML file to MS SQL.

Try 1 - I tried using http://xml2sql.sourceforge.net/, but it calls
XML::Parser, and XML::Parser dies upon finding the first invalid
element (as per the XML standard).
The problem is, this document has, AFAICT, _thousands_ of invalid XML
elements, as it is 16 _million_ lines long, and first passes have
indicated about 5 errors every thousand lines.

Try 2 - I tried using ?HTML tidy. ?Problem is, as it traverses this
large document, it takes more and more time to reach points of
failure, and when it fails, the process is killed, so I end up having
to trap errors like so:
?( tidy -mi -xml 179-TransferredCases.xml &> errs.tidy) & sleep 300; kill $!

See that "sleep 300"? It starts out at 3, then 6, then 12, ... then
... and captures about 6 errors each time before it does. ?I have sed
come in after and clean up the bad lines, at this point, simply by
deleting them:
?sed -i.bak -e '$lines' 179-TransferredCases.xml

All done through a script called tidy-sed.pl that I wrote for this.

But that "sleep 300" only got to me to around line 7 million, and now
it takes too long to trap even one error.

The plan was I would eliminate the bad elements that were killing me
in 'Try 1' so I could then use xml2mssql.

Try 3 - Now I started looking at some things I had earlier discounted,
like XML::Twig (to effectively rip the file into smaller pieces for
faster processing). ?But every one of them simply dies on first
invalid element, per the XML standard.

What I really want (I think) is to run one of these parsers over the
file, have it _not_ die on hitting _any_ invalid elements, and simply
in stead provide me with a list of all the bad elements so that I can
remove them, achieve a valid XML document, and then process it
according to xml2mssql.

This is actually my first real try at processing XML, and I would like
to _not_ have to spin something on my own. ?What am I missing please?

Thank you!

From jacoby at purdue.edu  Thu Feb  2 08:29:24 2012
From: jacoby at purdue.edu (Dave Jacoby)
Date: Thu, 02 Feb 2012 11:29:24 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
	<CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
Message-ID: <4F2AB9E4.4000107@purdue.edu>

On 2/2/2012 11:15 AM, Bradley Andersen wrote:
> But every one of them simply dies on first
> invalid element, per the XML standard.

There's a hole in the XML protocol where all the time goes
Jon Postel died for nothing, I suppose.

My first step would be to figure out the nature of these errors, to see 
if there's error classes that can be found and mass-fixed.

What generated this huge mass of XML? I would hope that an XML generator 
would not spit out something that an XML parser could not accept.

-- 
Dave Jacoby                         Address: WSLR S049
Code Maker                          Mail:    jacoby at purdue.edu
Purdue University                   Phone:   765.49.67368
    795 days until the end of XP support


From westerman at purdue.edu  Thu Feb  2 08:30:35 2012
From: westerman at purdue.edu (Rick Westerman)
Date: Thu, 2 Feb 2012 11:30:35 -0500 (EST)
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
Message-ID: <1002132585.325296.1328200235167.JavaMail.root@mailhub016.itcs.purdue.edu>

All of the times I have tried parsing XML with a Perl-based parser they have been picky about the format.  As you said, 

"... dies upon finding the first invalid element (as per the XML standard) ..."

Your idea of making the file compatible is a good one.  However I suspect that it will take some gnarly custom code to do so.  At which point you might as well parse and write directly to SQL.


----- Original Message -----
> So I need to convert a 700 mb XML file to MS SQL.
> 
> Try 1 - I tried using http://xml2sql.sourceforge.net/, but it calls
> XML::Parser, and XML::Parser dies upon finding the first invalid
> element (as per the XML standard).
> The problem is, this document has, AFAICT, _thousands_ of invalid XML
> elements, as it is 16 _million_ lines long, and first passes have
> indicated about 5 errors every thousand lines.
> 
> Try 2 - I tried using HTML tidy. Problem is, as it traverses this
> large document, it takes more and more time to reach points of
> failure, and when it fails, the process is killed, so I end up having
> to trap errors like so:
> ( tidy -mi -xml 179-TransferredCases.xml &> errs.tidy) & sleep 300;
> kill $!
> 
> See that "sleep 300"? It starts out at 3, then 6, then 12, ... then
> ... and captures about 6 errors each time before it does. I have sed
> come in after and clean up the bad lines, at this point, simply by
> deleting them:
> sed -i.bak -e '$lines' 179-TransferredCases.xml
> 
> All done through a script called tidy-sed.pl that I wrote for this.
> 
> But that "sleep 300" only got to me to around line 7 million, and now
> it takes too long to trap even one error.
> 
> The plan was I would eliminate the bad elements that were killing me
> in 'Try 1' so I could then use xml2mssql.
> 
> Try 3 - Now I started looking at some things I had earlier discounted,
> like XML::Twig (to effectively rip the file into smaller pieces for
> faster processing). But every one of them simply dies on first
> invalid element, per the XML standard.
> 
> What I really want (I think) is to run one of these parsers over the
> file, have it _not_ die on hitting _any_ invalid elements, and simply
> in stead provide me with a list of all the bad elements so that I can
> remove them, achieve a valid XML document, and then process it
> according to xml2mssql.
> 
> This is actually my first real try at processing XML, and I would like
> to _not_ have to spin something on my own. What am I missing please?
> 
> Thank you!
> _______________________________________________
> Purdue-pm mailing list
> Purdue-pm at pm.org
> http://mail.pm.org/mailman/listinfo/purdue-pm

-- 
Rick Westerman 
westerman at purdue.edu

Bioinformatics specialist at the Genomics Facility.
Phone: (765) 494-0505           FAX: (765) 496-7255
Department of Horticulture and Landscape Architecture
625 Agriculture Mall Drive
West Lafayette, IN 47907-2010
Physically located in room S049, WSLR building

From bradley.d.andersen at gmail.com  Thu Feb  2 08:32:56 2012
From: bradley.d.andersen at gmail.com (Bradley Andersen)
Date: Thu, 2 Feb 2012 11:32:56 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <4F2AB9E4.4000107@purdue.edu>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
	<CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
	<4F2AB9E4.4000107@purdue.edu>
Message-ID: <CAN+Wnj0pfBzxE3knsPESpbkiBasiMsYBRat7882pQMsPGKvWEg@mail.gmail.com>

Here's the background:

Some guy who used to work where I work executed some queries and for
some reason dumped the results to XML.  Now the DB is not accessible
and Some guy is not accessible, and the task for me is to try and get
the data back from this XML.

It's convoluted and should not have occurred in the first place :(


On Thu, Feb 2, 2012 at 11:29 AM, Dave Jacoby <jacoby at purdue.edu> wrote:
> On 2/2/2012 11:15 AM, Bradley Andersen wrote:
>>
>> But every one of them simply dies on first
>> invalid element, per the XML standard.
>
>
> There's a hole in the XML protocol where all the time goes
> Jon Postel died for nothing, I suppose.
>
> My first step would be to figure out the nature of these errors, to see if
> there's error classes that can be found and mass-fixed.
>
> What generated this huge mass of XML? I would hope that an XML generator
> would not spit out something that an XML parser could not accept.
>
> --
> Dave Jacoby ? ? ? ? ? ? ? ? ? ? ? ? Address: WSLR S049
> Code Maker ? ? ? ? ? ? ? ? ? ? ? ? ?Mail: ? ?jacoby at purdue.edu
> Purdue University ? ? ? ? ? ? ? ? ? Phone: ? 765.49.67368
> ? 795 days until the end of XP support
>
> _______________________________________________
> Purdue-pm mailing list
> Purdue-pm at pm.org
> http://mail.pm.org/mailman/listinfo/purdue-pm

From jacoby at purdue.edu  Thu Feb  2 09:21:23 2012
From: jacoby at purdue.edu (Dave Jacoby)
Date: Thu, 02 Feb 2012 12:21:23 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <CAN+Wnj0pfBzxE3knsPESpbkiBasiMsYBRat7882pQMsPGKvWEg@mail.gmail.com>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
	<CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
	<4F2AB9E4.4000107@purdue.edu>
	<CAN+Wnj0pfBzxE3knsPESpbkiBasiMsYBRat7882pQMsPGKvWEg@mail.gmail.com>
Message-ID: <4F2AC613.3040606@purdue.edu>

On 2/2/2012 11:32 AM, Bradley Andersen wrote:
> Here's the background:
>
> Some guy who used to work where I work executed some queries and for
> some reason dumped the results to XML.  Now the DB is not accessible
> and Some guy is not accessible, and the task for me is to try and get
> the data back from this XML.
>
> It's convoluted and should not have occurred in the first place:(

It happens. We had a power outage yesterday. Instrument had a UPS, but 
the computer controlling it didn't, so, practically, the instrument did 
not have a UPS. And we had one floating around, unused. Should not have 
been like that, but I didn't think.

By spec, XML is supposed to be well-formed, so it is brittle by design. 
I'd think about writing try and catch into something, but from what 
little I've done with XML::Parser, I'd think it was an all-or-nothing thing.

MySQL exports as SQL, so the restore looks like mysql < backup.sql. I 
hope you can set the new backup scheme into something equally sane.

And I'm sure this has been nearly useless too you. Sorry, and good luck.

-- 
Dave Jacoby                         Address: WSLR S049
Code Maker                          Mail:    jacoby at purdue.edu
Purdue University                   Phone:   765.49.67368
    795 days until the end of XP support


From bradley.d.andersen at gmail.com  Thu Feb  2 09:26:35 2012
From: bradley.d.andersen at gmail.com (Bradley Andersen)
Date: Thu, 2 Feb 2012 12:26:35 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <4F2AC613.3040606@purdue.edu>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
	<CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
	<4F2AB9E4.4000107@purdue.edu>
	<CAN+Wnj0pfBzxE3knsPESpbkiBasiMsYBRat7882pQMsPGKvWEg@mail.gmail.com>
	<4F2AC613.3040606@purdue.edu>
Message-ID: <CAN+Wnj1W4w61bWpbSQHT5c0OW0e20tqR8RFR0mbGxzhD68-M6w@mail.gmail.com>

well if i were in charge of this little data xfer, i would certainly
have taken my usual route:
  mysqldump -u -p dbname > dbname.`date +%s`.sql
then a quickie
  mysql -u -p dbname < dbname.datefromabove.sql
works really well.

unfortunately, i am not privy to why they output it this way and it
seems prohibitively difficult (at least) to recreate the queries and
do it the "right" way.

as for the responses from this list, i'm happy anyone responded at
all.  i was thinking maybe i had a gaping hole in my understanding,
but it turns out i do not, which is a nice validation :)

surely a perl module needs to be written that does not keel over and
die to meet the standard.  there could be a flag that says, "hey
XML::Parser, if you want to die, don't just output the $@ to a file,
kthxbye!"

i am actually surprised that this does not seem to already exist.


On Thu, Feb 2, 2012 at 12:21 PM, Dave Jacoby <jacoby at purdue.edu> wrote:
> On 2/2/2012 11:32 AM, Bradley Andersen wrote:
>>
>> Here's the background:
>>
>> Some guy who used to work where I work executed some queries and for
>> some reason dumped the results to XML. ?Now the DB is not accessible
>> and Some guy is not accessible, and the task for me is to try and get
>> the data back from this XML.
>>
>> It's convoluted and should not have occurred in the first place:(
>
>
> It happens. We had a power outage yesterday. Instrument had a UPS, but the
> computer controlling it didn't, so, practically, the instrument did not have
> a UPS. And we had one floating around, unused. Should not have been like
> that, but I didn't think.
>
> By spec, XML is supposed to be well-formed, so it is brittle by design. I'd
> think about writing try and catch into something, but from what little I've
> done with XML::Parser, I'd think it was an all-or-nothing thing.
>
> MySQL exports as SQL, so the restore looks like mysql < backup.sql. I hope
> you can set the new backup scheme into something equally sane.
>
> And I'm sure this has been nearly useless too you. Sorry, and good luck.
>
>
> --
> Dave Jacoby ? ? ? ? ? ? ? ? ? ? ? ? Address: WSLR S049
> Code Maker ? ? ? ? ? ? ? ? ? ? ? ? ?Mail: ? ?jacoby at purdue.edu
> Purdue University ? ? ? ? ? ? ? ? ? Phone: ? 765.49.67368
> ? 795 days until the end of XP support
>

From gizmo at purdue.edu  Thu Feb  2 09:45:51 2012
From: gizmo at purdue.edu (Joe Kline)
Date: Thu, 02 Feb 2012 12:45:51 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
	<CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
Message-ID: <4F2ACBCF.7040106@purdue.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Well, my first thought was one of the "tree" parsers that walk the doc
tree so it doesn't have to parse the whole thing.

Here's a Perl Monks node for dealing with invalid XML characters that
might point towards some ideas:

http://www.perlmonks.org/?node_id=752527

There's always stackexchange to see this has been asked before, and if
not you should get some suggestions rather quickly.

Maybe XML::SAX?

joe
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/

iEYEARECAAYFAk8qy84ACgkQb0mzA2gRTpk0NQCfaszn3n70v1hhEZXGluhRndCA
/VwAnAs7r1p+kcxhCnvmGC1Q69MbM9gA
=ZGHO
-----END PGP SIGNATURE-----

From bradley.d.andersen at gmail.com  Thu Feb  2 13:23:31 2012
From: bradley.d.andersen at gmail.com (Bradley Andersen)
Date: Thu, 2 Feb 2012 16:23:31 -0500
Subject: [Purdue-pm] XML + Perl
In-Reply-To: <4F2ACBCF.7040106@purdue.edu>
References: <CAN+Wnj3TU=uRnY8U56bqCUtohyRtHVy8dwhJ9prC=TLKOjh=0Q@mail.gmail.com>
	<CAN+Wnj1DoJGnFHUwTwdVDorQC0Tjj=MWC4EFfLsG66O_CgCy+Q@mail.gmail.com>
	<4F2ACBCF.7040106@purdue.edu>
Message-ID: <CAN+Wnj38br54n8kwM5i5j0Yfk-MdCqY9eHRz1etYyg=bKMy0yA@mail.gmail.com>

The perlmongers link led me to xmllint:
    xmllint -recover bak.179-TransferredCases.xml --output 179T.xml

-recover tells xmllint to keep the <valid> and throw away the
<invalid>, and, well, --output seems self-explanatory.

But then look at this:
   bradley at pvnp:~/x2s/xml2sql$ wc -l ../179T.xml
   2152938 ../179T.xml

So there's 2152938 valid lines, right?

Not so fast:
    bradley at pvnp:~/x2s/xml2sql$ xml2mssql.pl < ../179T.xml > 179T.sql

    not well-formed (invalid token) at line 2927058, column 59, byte
115008246 at /usr/local/lib/perl5/XML/Parser.pm

WHAT??!!

So xml2mssql found an invalid token on line 2927058 of a 2152938-line file ...


On Thu, Feb 2, 2012 at 12:45 PM, Joe Kline <gizmo at purdue.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Well, my first thought was one of the "tree" parsers that walk the doc
> tree so it doesn't have to parse the whole thing.
>
> Here's a Perl Monks node for dealing with invalid XML characters that
> might point towards some ideas:
>
> http://www.perlmonks.org/?node_id=752527
>
> There's always stackexchange to see this has been asked before, and if
> not you should get some suggestions rather quickly.
>
> Maybe XML::SAX?
>
> joe
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (GNU/Linux)
> Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk8qy84ACgkQb0mzA2gRTpk0NQCfaszn3n70v1hhEZXGluhRndCA
> /VwAnAs7r1p+kcxhCnvmGC1Q69MbM9gA
> =ZGHO
> -----END PGP SIGNATURE-----
> _______________________________________________
> Purdue-pm mailing list
> Purdue-pm at pm.org
> http://mail.pm.org/mailman/listinfo/purdue-pm

From jacoby at purdue.edu  Wed Feb  8 07:51:15 2012
From: jacoby at purdue.edu (Dave Jacoby)
Date: Wed, 08 Feb 2012 10:51:15 -0500
Subject: [Purdue-pm] Ideas for Next Meeting?
Message-ID: <4F3299F3.4050508@purdue.edu>

I've taken over the PurduePM twitter account, and I'm trying to start 
generating some buzz, so having more talks, rather than a general 
discussion, would be helpful.

(Not that I don't like the general discussion.)

Anybody have talks you're willing/able to give for the Feb 21 meeting?

I mostly have one on Dancer, the MVC I'm *almost* ready to deploy. I 
could talk a little bit about OAuth and it's use in my command-line 
twitter client, and I could get somewhere and have something 
demonstratable with Twilio, an API to telephony. But I'd rather not 
ramble on for an hour.

-- 
Dave Jacoby                         Address: WSLR S049
Code Maker                          Mail:    jacoby at purdue.edu
Purdue University                   Phone:   765.49.67368
    789 days until the end of XP support


From bradley.d.andersen at gmail.com  Thu Feb  9 17:19:18 2012
From: bradley.d.andersen at gmail.com (Bradley Andersen)
Date: Thu, 9 Feb 2012 20:19:18 -0500
Subject: [Purdue-pm] Perl Position Opening
Message-ID: <CAN+Wnj1DYmZZcUSMDaL6t7fnP_nLD3CF6-OSS7phXVDFUTSZww@mail.gmail.com>

Hi,

My company is looking for a Perl Developer.  The position would be
local to Indianapolis, not telecommute-friendly.  If you or someone
you know is interested, please reply to me privately and I will
provide more details.

Thank you,
Brad Andersen

From bradley.d.andersen at gmail.com  Wed Feb 15 09:17:17 2012
From: bradley.d.andersen at gmail.com (Bradley Andersen)
Date: Wed, 15 Feb 2012 12:17:17 -0500
Subject: [Purdue-pm] SOLVED -- Re:  XML + Perl
Message-ID: <CAN+Wnj2p4-BjHL9-gV-pPfRtPdFhnfNpBjBBTbR1-P850TPfhg@mail.gmail.com>

I actually hadn't looked at this again until this morning.

The problem is, there are supposed to be 179 nodes of interest over 16
million+ lines, so I was not able to easily determine the structure,
as editors seemed to die trying to read it.

I sat down this morning and split the file into 160 parts (files) of
size ~ 100,000 lines, looked at the first and last parts (files 1 and
160), and I think I have the structure now.  I'll run some random
sample of the other chunks to verify.  But I think it is easily solved
now without XML parsers.

Just FYI in case anyone is interested :)


On Thu, Feb 2, 2012 at 4:23 PM, Bradley Andersen
<bradley.d.andersen at gmail.com> wrote:
> The perlmongers link led me to xmllint:
> ? ?xmllint -recover bak.179-TransferredCases.xml --output 179T.xml
>
> -recover tells xmllint to keep the <valid> and throw away the
> <invalid>, and, well, --output seems self-explanatory.
>
> But then look at this:
> ? bradley at pvnp:~/x2s/xml2sql$ wc -l ../179T.xml
> ? 2152938 ../179T.xml
>
> So there's 2152938 valid lines, right?
>
> Not so fast:
> ? ?bradley at pvnp:~/x2s/xml2sql$ xml2mssql.pl < ../179T.xml > 179T.sql
>
> ? ?not well-formed (invalid token) at line 2927058, column 59, byte
> 115008246 at /usr/local/lib/perl5/XML/Parser.pm
>
> WHAT??!!
>
> So xml2mssql found an invalid token on line 2927058 of a 2152938-line file ...
>
>
>
>
>
>
> On Thu, Feb 2, 2012 at 12:45 PM, Joe Kline <gizmo at purdue.edu> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Well, my first thought was one of the "tree" parsers that walk the doc
>> tree so it doesn't have to parse the whole thing.
>>
>> Here's a Perl Monks node for dealing with invalid XML characters that
>> might point towards some ideas:
>>
>> http://www.perlmonks.org/?node_id=752527
>>
>> There's always stackexchange to see this has been asked before, and if
>> not you should get some suggestions rather quickly.
>>
>> Maybe XML::SAX?
>>
>> joe
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v2.0.14 (GNU/Linux)
>> Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAk8qy84ACgkQb0mzA2gRTpk0NQCfaszn3n70v1hhEZXGluhRndCA
>> /VwAnAs7r1p+kcxhCnvmGC1Q69MbM9gA
>> =ZGHO
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> Purdue-pm mailing list
>> Purdue-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/purdue-pm

From jacoby at purdue.edu  Wed Feb 15 09:54:00 2012
From: jacoby at purdue.edu (Dave Jacoby)
Date: Wed, 15 Feb 2012 12:54:00 -0500
Subject: [Purdue-pm] SOLVED -- Re:  XML + Perl
In-Reply-To: <CAN+Wnj2p4-BjHL9-gV-pPfRtPdFhnfNpBjBBTbR1-P850TPfhg@mail.gmail.com>
References: <CAN+Wnj2p4-BjHL9-gV-pPfRtPdFhnfNpBjBBTbR1-P850TPfhg@mail.gmail.com>
Message-ID: <4F3BF138.7020801@purdue.edu>

On 2/15/2012 12:17 PM, Bradley Andersen wrote:
> The problem is, there are supposed to be 179 nodes of interest over 16
> million+ lines, so I was not able to easily determine the structure,
> as editors seemed to die trying to read it.
>
> I sat down this morning and split the file into 160 parts (files) of
> size ~ 100,000 lines, looked at the first and last parts (files 1 and
> 160), and I think I have the structure now.  I'll run some random
> sample of the other chunks to verify.  But I think it is easily solved
> now without XML parsers.
>
> Just FYI in case anyone is interested:)

Good for you. Divide and conquer is always a good debugging principle.

-- 
Dave Jacoby                         Address: WSLR S049
Code Maker                          Mail:    jacoby at purdue.edu
Purdue University                   Phone:   765.49.67368
    782 days until the end of XP support


From mark at purdue.edu  Mon Feb 20 04:32:44 2012
From: mark at purdue.edu (Mark Senn)
Date: Mon, 20 Feb 2012 07:32:44 -0500
Subject: [Purdue-pm] _Programming Perl_ fourth edition
Message-ID: <14875.1329741164@pier.ecn.purdue.edu>

_Programming Perl_ (also known as the camel book because there is a camel
on the cover) ebook is out.  Printed version is estimated to be out later
this month.  It is advertised on
    http://shop.oreilly.com/product/9780596004927.do
for $19.99 ebook, $54 pre-order physical book, buy 2 get 1 free,
free shipping on orders over $29.99, and 100% guarantee.  It covers
Perl 5.14 with  preview of features in the upcoming 5.16.

Would different people like to put their orders together to save money?

I have not checked if the Purdue Perl Mongers as a user group could get
a better deal than this.  I'll do that next.

-mark

From mdw at purdue.edu  Mon Feb 20 06:07:03 2012
From: mdw at purdue.edu (Mark Daniel Ward)
Date: Mon, 20 Feb 2012 09:07:03 -0500
Subject: [Purdue-pm] _Programming Perl_ fourth edition
In-Reply-To: <14875.1329741164@pier.ecn.purdue.edu>
References: <14875.1329741164@pier.ecn.purdue.edu>
Message-ID: <4F425387.4050501@purdue.edu>

Dear Mark,

It's only $29.19 on Amazon for the pre-order paperback, and they have a 
price guarantee that, if the price drops before it is printed (a bit 
unlikely in this case, I guess), you automatically get the lower price.

http://www.amazon.com/dp/0596004923/

Still, this beats the buy 2 get 1 free pricing on O'Reilly itself.

By the way, for the e-book, I'm sure we'll get this in Safari bookshelf 
through Purdue libraries if people want to read it electronically, at 
least while connected to the web.

Mark


On 2/20/12 7:32 AM, Mark Senn wrote:
> _Programming Perl_ (also known as the camel book because there is a camel
> on the cover) ebook is out.  Printed version is estimated to be out later
> this month.  It is advertised on
>      http://shop.oreilly.com/product/9780596004927.do
> for $19.99 ebook, $54 pre-order physical book, buy 2 get 1 free,
> free shipping on orders over $29.99, and 100% guarantee.  It covers
> Perl 5.14 with  preview of features in the upcoming 5.16.
>
> Would different people like to put their orders together to save money?
>
> I have not checked if the Purdue Perl Mongers as a user group could get
> a better deal than this.  I'll do that next.
>
> -mark
> _______________________________________________
> Purdue-pm mailing list
> Purdue-pm at pm.org
> http://mail.pm.org/mailman/listinfo/purdue-pm

-- 
8th International Purdue Symposium on Statistics
"Diversity in the Statistical Sciences for the 21st Century"
http://www.stat.purdue.edu/symp2012/
Purdue University,  June 20 - 24, 2012


From jacoby at purdue.edu  Mon Feb 20 06:56:23 2012
From: jacoby at purdue.edu (Dave Jacoby)
Date: Mon, 20 Feb 2012 09:56:23 -0500
Subject: [Purdue-pm] _Programming Perl_ fourth edition
In-Reply-To: <4F425387.4050501@purdue.edu>
References: <14875.1329741164@pier.ecn.purdue.edu>
	<4F425387.4050501@purdue.edu>
Message-ID: <4F425F17.4060200@purdue.edu>

On 2/20/2012 9:07 AM, Mark Daniel Ward wrote:
>
> By the way, for the e-book, I'm sure we'll get this in Safari bookshelf
> through Purdue libraries if people want to read it electronically, at
> least while connected to the web.

Valid and good points, except:

1) there's a finite number of licenses, and those fill up often during 
the work day. If you're a night hacker, or here during summer months, 
it's better.

2) I can't connect to Safari when you connect into campus via VPN. I can 
then RDP into my work Win box, or perhaps SSH redirect, but that seems 
overboard. If you buy it, you can have it with you.

3) I have the one O'Reilly ebook on my phone, my laptop and my Nook, so 
when I finally get around to writing Android applications, I can have 
that info where I'm programming, where I'm testing, or on a third screen 
so I can read, then program, then test.

4) There are books that O'Reilly has, that O'Reilly has on Safari, that 
you cannot get to via Purdue's Safari. I've bumped into that. I can't 
say *which* books they are, which doesn't help my case, I know, but I've 
had searches between moments where the licenses are pretty full where 
the book I want is there, but when I log in again and re-search, it is gone.

-- 
Dave Jacoby                         Address: WSLR S049
Code Maker                          Mail:    jacoby at purdue.edu
Purdue University                   Phone:   765.49.67368
    777 days until the end of XP support


From mdw at purdue.edu  Mon Feb 20 07:02:53 2012
From: mdw at purdue.edu (Mark Daniel Ward)
Date: Mon, 20 Feb 2012 10:02:53 -0500
Subject: [Purdue-pm] _Programming Perl_ fourth edition
In-Reply-To: <4F425F17.4060200@purdue.edu>
References: <14875.1329741164@pier.ecn.purdue.edu>
	<4F425387.4050501@purdue.edu> <4F425F17.4060200@purdue.edu>
Message-ID: <4F42609D.9030404@purdue.edu>

Dear Dave,

All excellent points!  Thank you!

To help in your 4th case (just FYI), I found out that there's a limited 
number of books that Purdue is licensed to keep on our Safari bookshelf 
for the campus.  Fortunately, the books that Purdue chooses to make 
available is completely dynamic.  I.e., you can ask the staff from the 
library to update the list, if you want a book added, or if a book you 
use is removed.  My understanding is that they try to only remove access 
to books that nobody is using often, to make room for new books.  They 
have added books very quickly for me, whenever I requested this....   
Last time I checked,

Charlotte Erdmann ( erdmann at purdue.edu )

was the person to contact if you want books added to the Safari 
bookshelf that aren't there.  I've used access to O'Reilly books through 
Safari several times while teaching, to save my students money, and they 
had a very positive response to it.

I agree that it can be difficult from off campus.  I encourage you to 
ask her about it (if Charlotte still handles such requests).  I got a 
very positive and quick response, every time I asked about the Safari 
bookshelf.

Best wishes,
Mark


On 2/20/12 9:56 AM, Dave Jacoby wrote:
> On 2/20/2012 9:07 AM, Mark Daniel Ward wrote:
>>
>> By the way, for the e-book, I'm sure we'll get this in Safari bookshelf
>> through Purdue libraries if people want to read it electronically, at
>> least while connected to the web.
>
> Valid and good points, except:
>
> 1) there's a finite number of licenses, and those fill up often during 
> the work day. If you're a night hacker, or here during summer months, 
> it's better.
>
> 2) I can't connect to Safari when you connect into campus via VPN. I 
> can then RDP into my work Win box, or perhaps SSH redirect, but that 
> seems overboard. If you buy it, you can have it with you.
>
> 3) I have the one O'Reilly ebook on my phone, my laptop and my Nook, 
> so when I finally get around to writing Android applications, I can 
> have that info where I'm programming, where I'm testing, or on a third 
> screen so I can read, then program, then test.
>
> 4) There are books that O'Reilly has, that O'Reilly has on Safari, 
> that you cannot get to via Purdue's Safari. I've bumped into that. I 
> can't say *which* books they are, which doesn't help my case, I know, 
> but I've had searches between moments where the licenses are pretty 
> full where the book I want is there, but when I log in again and 
> re-search, it is gone.
>

-- 
8th International Purdue Symposium on Statistics
"Diversity in the Statistical Sciences for the 21st Century"
http://www.stat.purdue.edu/symp2012/
Purdue University,  June 20 - 24, 2012


From jacoby at purdue.edu  Mon Feb 20 07:10:37 2012
From: jacoby at purdue.edu (Dave Jacoby)
Date: Mon, 20 Feb 2012 10:10:37 -0500
Subject: [Purdue-pm] Old Geezer Reminiscing
Message-ID: <4F42626D.9090502@purdue.edu>

As we know, Purdue has several ways to address email. There's the 
qualified name (david.a.jacoby.1 at purdue.edu), the career account name 
(djacoby at purdue.edu) and the alias (jacoby at purdue.edu).

I am subscribed to this list as jacoby at purdue.edu, but my email client 
on my phone has me sending and receiving mail as djacoby at purdue.edu. I 
sent a reply to Mark and the list this AM, which was bounced because I 
sent it from my phone and thus djacoby at purdue and not jacoby at purdue.

purdue-pm is a mailman list. I came up with LISTSERV, and I know that 
you could set a LISTSERV list so you could be subscribed but not 
receive. If that setting is available to me for purdue-pm, it isn't 
clear by the web interface. It was useful at the time so you didn't get 
all the mail when you went on vacation, for example, and it would be 
useful to me now to allow me to receive mail only once when I'm 
subscribed twice. As is, djacoby is now set to receive the digest, while 
jacoby is set to receive things as they come out.

Things where so much better in the good old days. Now, get off my lawn!

-- 
Dave Jacoby                         Address: WSLR S049
Code Maker                          Mail:    jacoby at purdue.edu
Purdue University                   Phone:   765.49.67368
    777 days until the end of XP support