[Pdx-pm] reading a broken CSV file

Eric Shore Baur ebaur at aracnet.com
Fri Nov 21 18:35:12 CST 2003


	I doing an import from a CSV-style text file into a SQL database.
The data is set up so that I have one set of text files with a field
listing in them (so I know what matches up with what) and then the data
files in a parent directory.
	The data format looks something like this:

"title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text"

	Fine... I can import that.  Unfortunatly, some of the records have
embeded newlines in them, so you end up with something like this:

"title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text
goes here
until
the record
is done"

	... or, potentially:

"title","some text goes
over
multiple
lines","a date is next",1999/05/10,T,123,F,F,T,"more text"

	What I've been doing is simply doing the data import - letting
those screwed up lines fail when the SQL inserts run and then going back
and hand entering the screwed up data (since I"ll end up with partial
records, so I can search for the missing last field).  This is not,
however, a very maintainable method.  (I have to re-import things when the
data set changes, I get all new files, not just changes.)
	Is there any neat/slick way to get this data in there on the first
pass?  I tried using ParseWords, but I'm not sure if I utilized it to its
fullest extent.  I briefly played with a CSV driver for DBI, but it
couldn't handle things split over the newlines, either.

	This was awhile ago that I did this in the first place, I'm just
picking the project back up off the shelf, so to speak.  Although I had
kind of figured I'd have to re-write from scratch, I didn't want to fight
the same issues if there was an easy way out of it... any ideas?

Thanks,
Eric




More information about the Pdx-pm-list mailing list