[Pdx-pm] reading a broken CSV file
sechrest at peak.org
Fri Nov 21 18:00:04 CST 2003
Why not do a text substitution ? Do you have any indicator of what
an end of field looks like?
Can you say that a record only ends when you have a " on the end of a line?
Or do you have to count the records.
Sounds like a pre-parser to force things into the right form is a good
place to start.
Eric Shore Baur <ebaur at aracnet.com> writes:
% I doing an import from a CSV-style text file into a SQL database.
% The data is set up so that I have one set of text files with a field
% listing in them (so I know what matches up with what) and then the data
% files in a parent directory.
% The data format looks something like this:
% "title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text"
% Fine... I can import that. Unfortunatly, some of the records have
% embeded newlines in them, so you end up with something like this:
% "title","some text","a date is next",1999/05/10,T,123,F,F,T,"more text
% goes here
% the record
% is done"
% ... or, potentially:
% "title","some text goes
% lines","a date is next",1999/05/10,T,123,F,F,T,"more text"
% What I've been doing is simply doing the data import - letting
% those screwed up lines fail when the SQL inserts run and then going back
% and hand entering the screwed up data (since I"ll end up with partial
% records, so I can search for the missing last field). This is not,
% however, a very maintainable method. (I have to re-import things when the
% data set changes, I get all new files, not just changes.)
% Is there any neat/slick way to get this data in there on the first
% pass? I tried using ParseWords, but I'm not sure if I utilized it to its
% fullest extent. I briefly played with a CSV driver for DBI, but it
% couldn't handle things split over the newlines, either.
% This was awhile ago that I did this in the first place, I'm just
% picking the project back up off the shelf, so to speak. Although I had
% kind of figured I'd have to re-write from scratch, I didn't want to fight
% the same issues if there was an easy way out of it... any ideas?
% Pdx-pm-list mailing list
% Pdx-pm-list at mail.pm.org
John Sechrest . Helping people use
CTO PEAK - . computers and the Internet
Public Electronic . more effectively
Access to Knowledge,Inc .
1600 SW Western, Suite 180 . Internet: sechrest at peak.org
Corvallis Oregon 97333 . (541) 754-7325
More information about the Pdx-pm-list