[Pdx-pm] reading a broken CSV file
Eric Shore Baur
ebaur at aracnet.com
Fri Nov 21 18:35:12 CST 2003
I doing an import from a CSV-style text file into a SQL database.
The data is set up so that I have one set of text files with a field
listing in them (so I know what matches up with what) and then the data
files in a parent directory.
The data format looks something like this:
"title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text"
Fine... I can import that. Unfortunatly, some of the records have
embeded newlines in them, so you end up with something like this:
"title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text
... or, potentially:
"title","some text goes
lines","a date is next",1999/05/10,T,123,F,F,T,"more text"
What I've been doing is simply doing the data import - letting
those screwed up lines fail when the SQL inserts run and then going back
and hand entering the screwed up data (since I"ll end up with partial
records, so I can search for the missing last field). This is not,
however, a very maintainable method. (I have to re-import things when the
data set changes, I get all new files, not just changes.)
Is there any neat/slick way to get this data in there on the first
pass? I tried using ParseWords, but I'm not sure if I utilized it to its
fullest extent. I briefly played with a CSV driver for DBI, but it
couldn't handle things split over the newlines, either.
This was awhile ago that I did this in the first place, I'm just
picking the project back up off the shelf, so to speak. Although I had
kind of figured I'd have to re-write from scratch, I didn't want to fight
the same issues if there was an easy way out of it... any ideas?
More information about the Pdx-pm-list