[Pdx-pm] reading a broken CSV file

John Sechrest sechrest at peak.org
Fri Nov 21 18:00:04 CST 2003


Why not do a text substitution ? Do you have any indicator of what 
an end of field looks like?

Can you say that a record only ends when you have a " on the end of a line?
Or do you have to count the records.

Sounds like a pre-parser to force things into the right form is a good 
place to start.




Eric Shore Baur <ebaur at aracnet.com> writes:

 % 
 % 	I doing an import from a CSV-style text file into a SQL database.
 % The data is set up so that I have one set of text files with a field
 % listing in them (so I know what matches up with what) and then the data
 % files in a parent directory.
 % 	The data format looks something like this:
 % 
 % "title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text"
 % 
 % 	Fine... I can import that.  Unfortunatly, some of the records have
 % embeded newlines in them, so you end up with something like this:
 % 
 % "title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text
 % goes here
 % until
 % the record
 % is done"
 % 
 % 	... or, potentially:
 % 
 % "title","some text goes
 % over
 % multiple
 % lines","a date is next",1999/05/10,T,123,F,F,T,"more text"
 % 
 % 	What I've been doing is simply doing the data import - letting
 % those screwed up lines fail when the SQL inserts run and then going back
 % and hand entering the screwed up data (since I"ll end up with partial
 % records, so I can search for the missing last field).  This is not,
 % however, a very maintainable method.  (I have to re-import things when the
 % data set changes, I get all new files, not just changes.)
 % 	Is there any neat/slick way to get this data in there on the first
 % pass?  I tried using ParseWords, but I'm not sure if I utilized it to its
 % fullest extent.  I briefly played with a CSV driver for DBI, but it
 % couldn't handle things split over the newlines, either.
 % 
 % 	This was awhile ago that I did this in the first place, I'm just
 % picking the project back up off the shelf, so to speak.  Although I had
 % kind of figured I'd have to re-write from scratch, I didn't want to fight
 % the same issues if there was an easy way out of it... any ideas?
 % 
 % Thanks,
 % Eric
 % 
 % _______________________________________________
 % Pdx-pm-list mailing list
 % Pdx-pm-list at mail.pm.org
 % http://mail.pm.org/mailman/listinfo/pdx-pm-list

-----
John Sechrest          .         Helping people use
CTO PEAK -              .           computers and the Internet
Public Electronic         .            more effectively
Access to Knowledge,Inc       .                      
1600 SW Western, Suite 180       .            Internet: sechrest at peak.org
Corvallis Oregon 97333               .                  (541) 754-7325
                                            . http://www.peak.org/~sechrest



More information about the Pdx-pm-list mailing list