[Thousand-Oaks-pm] CSV code example

David Oswald daoswald at gmail.com
Sat Sep 15 15:54:38 PDT 2012


On Sat, Sep 15, 2012 at 6:26 AM, Chuck Hardin <chardin at gmail.com> wrote:
> The following is a code example from TO-PM member Barry Brevik:
>
> Since we did not meet this month, let me throw some code at 'ya.
>
> I frequently have to make one-off utilities to parse customer CSV files.
> I try to avoid using modules for really simple things, so I use the
> subroutine shown below. I have received some pretty weird formatting,
> and this code handles most of them. Keep in mind that the rows in the
> __DATA__ section represent actual formatting of files that I have
> received.
>
> P.S. notice that the 4th row fails to parse... I have not dealt with it
> yet. Anyone with improvements or nasty comments should go ahead and
> post!

I tend to use modules to solve things when I have a suspicion that
they're deceptive in their simplicity.  I figure that the module has
withstood the pressure of thousands of users, evolved through bug
reports, unit tests, and so on, and emerged from that firey forge a
better solution than something I might come up with.  Nevertheless,
such solutions are not perfect, and the do come at the cost of
additional baggage in your scripts.  Much of the baggage CPAN modules
carry with them aims to solve problems that your specific situation
doesn't have.  Modules can also suffer from creeping featurism,
growing to satisfy the infrequent needs of a vocal minority.

But I still like the fact that the time I spend learning their API can
save me time in the longrun as I re-use that knowledge from project to
project.

Anyway, I went ahead and plopped Barry's parser alongside Tony's
parser, and added a call to Text::CSV's parser, and then displayed the
results for each parse.  You will see that it fails with example #2,
and disagree's with Tony's result for #4.  The CSV sample #2 really is
broken.  I understand the goal is to get it done.  But it's hard to be
certain what exactly the correct output *should* be.  If you remove a
comment from the object-instantiation line where I configure the CSV
parser, it actually does produce a parse for #2, but it's probably not
what you want it to be.  Here's the code:

use strict;
use warnings;
use Text::CSV;

chomp( my @csv_data = <DATA> );

my $csv_parser = Text::CSV->new(
    {
        binary           => 1,
        allow_whitespace => 1,

        #   allow_loose_escapes => 1,
    }
) or die "Cannot use CSV:" . Text::CSV->error_diag();

foreach my $csv_line (@csv_data) {
    print "csvLine before parseCSV: $csv_line\n";
    local $" = '],[';
    my @barry_parsed = parseCSV_barry($csv_line);
    print "Barry's parser:     [@barry_parsed]\n";
    my @tony_parsed = parseCSV_tony($csv_line);
    print "Tony's Parser:      [@tony_parsed]\n";
    my @tcsv_parsed = parseCSV_tcsv( $csv_parser, $csv_line );
    print "Text::CSV's Parser: [@tcsv_parsed]\n\n";
}

sub parseCSV_tony {
    my @line = split /","|",|,"/, shift;
    my @fields;
    foreach my $l (@line) {
        $l =~ s/"//g;
        $l =~ s/^\s+|\s+$//;
        push @fields, $l // q{};
    }
    return @fields;
}

sub parseCSV_tcsv {
    my ( $parser, $line ) = @_;
    $parser->parse($line) or do {
        warn '*** Malformed CSV: <<'
          . $parser->error_input
          . ">>:\n*** "
          . $parser->error_diag;
        return;
    };
    return $parser->fields;
}

sub parseCSV_barry {
    my @columns = ();
    if ( my $csvline = shift ) {

        # If the CSV line has any portion with 2 or more sequential commas ','
        # then replace the commas with pipe '|' characters.
        while ( $csvline =~ /(,{2,})/ ) {
            my $commas = $1;
            my $pipes  = '|' x length($1);
            $csvline =~ s/^(.*)$commas(.*)/$1$pipes$2/;
        }

       # If there are any commas embedded in the CSV quoted fields, replace them
       # with pipe '|' characters.
        $csvline =~ s/("[^",]+?),([^",]+?")/$1|$2/g;
        @columns = split ',',
          $csvline;    # Split the quoted fields at the remaining commas.
        s/\|/,/g  foreach @columns;    # Replace pipe characters with commas.
        s/\x22//g foreach @columns;    # Remove double quotes from each column.
        s/^\s+|\s+$//g
          foreach
          @columns;  # Remove leading and trailing white space from each column.
    }
    return @columns;
}

__DATA__
"col 1","col 2","col 3","col 4"
"col 1"",""col,,,,,,2"",""col ,,3","col "4""
"col 1","col  '2'",col  '3'  ,"col, 4"
"col 1,a,1",col 2,"col,3,b",col 4,

****** And the output *******

$ ./mytest.pl
csvLine before parseCSV: "col 1","col 2","col 3","col 4"
Barry's parser:     [col 1],[col 2],[col 3],[col 4]
Tony's Parser:      [col 1],[col 2],[col 3],[col 4]
Text::CSV's Parser: [col 1],[col 2],[col 3],[col 4]

csvLine before parseCSV: "col 1"",""col,,,,,,2"",""col ,,3","col "4""
Barry's parser:     [col 1],[col,,,,,,2],[col ,,3],[col 4]
Tony's Parser:      [col 1],[col,,,,,,2],[col ,,3],[col 4]
*** Malformed CSV: <<"col 1"",""col,,,,,,2"",""col ,,3","col "4"">>:
*** EIQ - QUO character not allowed at ./mytest.pl line 43, <DATA> line 4.
Text::CSV's Parser: []

csvLine before parseCSV: "col 1","col  '2'",col  '3'  ,"col, 4"
Barry's parser:     [col 1],[col  '2'],[col  '3'],[col, 4]
Tony's Parser:      [col 1],[col  '2'],[col  '3'],[col, 4]
Text::CSV's Parser: [col 1],[col  '2'],[col  '3'],[col, 4]

csvLine before parseCSV: "col 1,a,1",col 2,"col,3,b",col 4,
Barry's parser:     [col 1],[a],[1],[col 2],[col],[3],[b],[col 4]
Tony's Parser:      [col 1,a,1],[col 2],[col,3,b],[col 4,]
Text::CSV's Parser: [col 1,a,1],[col 2],[col,3,b],[col 4],[]

****** And the output from Devel::TraceUse (to see how expensive it
was to pull in Text::CSV ) ********

$ perl -d:TraceUse ./mytest.pl

......

Modules used from ./mytest.pl:
   1.  strict 1.07, mytest.pl line 3 [main]
   2.  warnings 1.13, mytest.pl line 4 [main]
   3.  Text::CSV 1.21, mytest.pl line 5 [main]
   4.    Carp 1.26, Text/CSV.pm line 5
   5.      Exporter 5.66, Carp.pm line 35
   6.    vars 1.02, Text/CSV.pm line 6
   7.      warnings::register 1.02, vars.pm line 7
   8.    Text::CSV_XS 0.91, Text/CSV.pm line 150 (eval 1)
   9.      DynaLoader 1.14, Text/CSV_XS.pm line 26
  10.        Config, DynaLoader.pm line 22


You can see that, in my case, Text::CSV is using the Text::CSV_XS
back-end.  That only happens if you've explicitly installed
Text::CSV_XS on your system.  Otherwise, it uses its own built-in
Text::CSV_PP back-end.  If we don't count pragmas, and don't count the
XS plugin (which is optional), by using Text::CSV we're essentially
asking Perl to include Text::CSV, Carp (which is CORE), and Exporter
(which is CORE).  So the only non-core dependency for Text::CSV is the
module itself.

The current version of Text::CSV has a 99.7% PASS rate with CPAN
testers (having passed 2445 of 2451 smokers), and the six non-passes
were all "UNKNOWN" (as opposed to FAIL).

One observation I wanted to make: In a situation where the CSV is
broken enough to not parse with Text::CSV, it might be better for the
script to make an easy to find annotation in your output file, and
issue a warning on-screen so that a human can come back later and find
the problem.  It is possible that in tweaking a solution to
automatically fix one broken construct, some other construct will fail
as a result.  At some point calling attention loudly to the problem
might be better than trying to solve it programatically.  That's just
one take on the issue, and obviously I don't know your specific use
and need.

Thanks for posting!!!

Dave
-- 

David Oswald
daoswald at gmail.com


More information about the Thousand-Oaks-pm mailing list