[Chicago-talk] Removing Characters

Brian Katzung briank at kappacs.com
Wed Oct 24 18:33:11 PDT 2007


tiger peng wrote:
> You are right if it is well formated CSV file.
> But I don't know if this is guaranteed. There is no document about it. I 
> just rewrite an old old uncommented/undocumented working scripts.
> 
> The segment is just rewritten to make it little bit easier to read by 
> replacing the embedded 'DEL' charters with a variable populate with 
> function chr.
> 
> Below is the best I can do. It looks better to me and can run as two 
> times faster than the old one does. I still cannot make out one-liner 
> for it. Can anyone get ride of the first line?
> 
>   my $leadingQ=""; $leadingQ='"' if /^"/; #save the leading quote if it 
> is there
>   s/(?<!,)"(?!(,|$))//g; # remove all double quote not next to comma or 
> at the end of the line
>   print OUTF $leadingQ, $_;

OK. Here's my one-liner version (well, two including the print) of the 
above:

s/(?<!,)(?:(^")|")(?!,|$)/defined($1)?$1:''/eg;
print OUTF $_;

> ----- Original Message ----
> From: Steven Lembark <lembark at wrkhors.com>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Wednesday, October 24, 2007 2:14:48 PM
> Subject: Re: [Chicago-talk] Removing Characters
> 
>  > There must be better way for removing the double quote in a CSV file
>  > optionally quoted by double quote.
>  > What I did as below is ugly and not reliable. Could anyone provide one
>  > beautify line?
>  >
>  >  $delimiter=chr(0227);
>  >  s/^"/$delimiter/g;
>  >  s/,"/,$delimiter/g;
>  >  s/"$/$delimiter/g;
>  >  s/",/$delimiter,/g;
>  >  s/"//g;
>  >  s/$delimiter/"/g;
> 
> You don't seem to want all of the quotes removed,
> only the embedded ones. If the data is well-formatted
> then the operation above will leave you with a bunch
> of naked backslashes in the text:
> 
>   "this is a \"double quoted\" text line"
> 
> becomes
> 
>   "this is a \double quoted\ text line"
> 
> and you probably don't want the \d or \ in your
> result.
> 
> If the real problem is that fate has handed you some
> CSV data with embedded, un-escaped quotes then your
> approach makes the most sense, but you'll have to
> remove escaped quotes also:
> 
>   s{ \\" }{}gx;
> 
> will strip the \" char's. You might prefer to replace
> them with non-delimiting quotes, e.g.,
> 
> 
>   s{ \\" }{'}gx;
> 
> All of the CSV parsing modules assume "clean" CSV
> source (oxymoron?) so if you need to clean up botched
> data then some iterative approach is likely to be
> what you need.
> 
> enjoi
> 
> -- 
> Steven Lembark                                        85-09 90th Street

Here's a "smarter" but longer obfuscation:

s{((?:^|,)")((?:[^"]|\\"|"")*?)(")(?=,|$)|([^,]*)}{
  (sub { my @a = @_; $a[1] =~ s/\\"|"//g;
  join('', grep(defined, @a)); })->
  ($1, defined($2)? $2: $4, $3)}eg;
print;

You're certainly at liberty to remove newlines from the code if the line 
count is that important to you. :-)

This one considers a field to be quoted only if it has (anchored) quotes 
at each end and internal quotes are \-escaped or doubled. All other \" 
and " that are not anchored in a quoted field are removed.

I haven't attempted any timings.

   - Brian

-- 
Brian Katzung, Kappa Computer Solutions, LLC
Leveraging UNIX, Linux, open source, and custom
software solutions for business and beyond
Phone: 877.367.8837 x1  http://www.kappacs.com



More information about the Chicago-talk mailing list