[Thousand-Oaks-pm] CSV code example

Chuck Hardin chardin at gmail.com
Sat Sep 15 06:26:18 PDT 2012


The following is a code example from TO-PM member Barry Brevik:

Since we did not meet this month, let me throw some code at 'ya.

I frequently have to make one-off utilities to parse customer CSV files.
I try to avoid using modules for really simple things, so I use the
subroutine shown below. I have received some pretty weird formatting,
and this code handles most of them. Keep in mind that the rows in the
__DATA__ section represent actual formatting of files that I have
received.

P.S. notice that the 4th row fails to parse... I have not dealt with it
yet. Anyone with improvements or nasty comments should go ahead and
post!

#
# parseCSV.pl
#
# This is a test wrapper for the parseCSV() subroutine.
#
use strict;
use warnings;

# Un-buffer STDOUT.
select((select(STDOUT), $| = 1)[0]);

while (<DATA>)
{
  print "csvLine before parseCSV: $_\n\n";
  my @csvArray = parseCSV($_);
  print "[$_]\n" foreach (@csvArray);
  print "\n\n";
}

#----------------------------------------------------------
# CALL with a CSV line.
#
# This routine parses a single CSV line and handles ',' chars embedded
# in fields as well as extraneous spaces in between dbl quoted fields.
# It is also resistant to extra dbl quotes within dbl quoted fields,
# but it will remove them.
#
sub parseCSV
{
 my @columns = ();

 if (my $csvline = shift)
 {
   # If the CSV line has any portion with 2 or more sequential commas
','
   # then replace the commas with pipe '|' characters.
   while ($csvline =~ /(,{2,})/)
   {
     my $commas = $1;
     my $pipes  = '|' x length($1);
     $csvline =~ s/^(.*)$commas(.*)/$1$pipes$2/;
   }

   # If there are any commas embedded in the CSV quoted fields, replace
them
   # with pipe '|' characters.
   $csvline =~ s/("[^",]+?),([^",]+?")/$1|$2/g;
   @columns = split ',', $csvline;   # Split the quoted fields at the
remaining commas.
   s/\|/,/g       foreach @columns;  # Replace pipe characters with
commas.
   s/\x22//g      foreach @columns;  # Remove double quotes from each
column.
   s/^\s+|\s+$//g foreach @columns;  # Remove leading and trailing
white space from each column.
 }

 return @columns;
}

__DATA__
"col 1","col 2","col 3","col 4"
"col 1"",""col,,,,,,2"",""col ,,3","col "4""
"col 1","col  '2'",col  '3'  ,"col, 4"
"col 1,a,1",col 2,"col,3,b",col 4,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.pm.org/pipermail/thousand-oaks-pm/attachments/20120915/f0620383/attachment.bin>


More information about the Thousand-Oaks-pm mailing list