[Chicago-talk] Removing Characters

tiger peng tigerpeng2001 at yahoo.com
Wed Oct 24 20:22:22 PDT 2007

Thanks  everyone. Here is the one-liner  I am  looking for!
perl -ne 's/(?<!,)(?:(^")|")(?!,|$)/defined($1)?$1:""/ge?print:print' in.csv > out.csv
(-; I have not used my real first name 'Ge' for regex for a while ;-)
I believe the replace character and the delimiter can be a set of character, such as ["'] [,:|].

Another question: How to use one-liner to split web log file?

----- Original Message ----
From: Brian Katzung <briank at kappacs.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Wednesday, October 24, 2007 8:33:11 PM
Subject: Re: [Chicago-talk] Removing Characters

tiger peng wrote:
> You are right if it is well formated CSV file.
> But I don't know if this is guaranteed. There is no document about
 it. I 
> just rewrite an old old uncommented/undocumented working scripts.
> The segment is just rewritten to make it little bit easier to read by
> replacing the embedded 'DEL' charters with a variable populate with 
> function chr.
> Below is the best I can do. It looks better to me and can run as two 
> times faster than the old one does. I still cannot make out one-liner
> for it. Can anyone get ride of the first line?
>   my $leadingQ=""; $leadingQ='"' if /^"/; #save the leading quote if
> is there
>   s/(?<!,)"(?!(,|$))//g; # remove all double quote not next to comma
> at the end of the line
>   print OUTF $leadingQ, $_;

OK. Here's my one-liner version (well, two including the print) of the 

print OUTF $_;

> ----- Original Message ----
> From: Steven Lembark <lembark at wrkhors.com>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Wednesday, October 24, 2007 2:14:48 PM
> Subject: Re: [Chicago-talk] Removing Characters
>  > There must be better way for removing the double quote in a CSV
>  > optionally quoted by double quote.
>  > What I did as below is ugly and not reliable. Could anyone provide
>  > beautify line?
>  >
>  >  $delimiter=chr(0227);
>  >  s/^"/$delimiter/g;
>  >  s/,"/,$delimiter/g;
>  >  s/"$/$delimiter/g;
>  >  s/",/$delimiter,/g;
>  >  s/"//g;
>  >  s/$delimiter/"/g;
> You don't seem to want all of the quotes removed,
> only the embedded ones. If the data is well-formatted
> then the operation above will leave you with a bunch
> of naked backslashes in the text:
>   "this is a \"double quoted\" text line"
> becomes
>   "this is a \double quoted\ text line"
> and you probably don't want the \d or \ in your
> result.
> If the real problem is that fate has handed you some
> CSV data with embedded, un-escaped quotes then your
> approach makes the most sense, but you'll have to
> remove escaped quotes also:
>   s{ \\" }{}gx;
> will strip the \" char's. You might prefer to replace
> them with non-delimiting quotes, e.g.,
>   s{ \\" }{'}gx;
> All of the CSV parsing modules assume "clean" CSV
> source (oxymoron?) so if you need to clean up botched
> data then some iterative approach is likely to be
> what you need.
> enjoi
> -- 
> Steven Lembark                                        85-09 90th

Here's a "smarter" but longer obfuscation:

  (sub { my @a = @_; $a[1] =~ s/\\"|"//g;
  join('', grep(defined, @a)); })->
  ($1, defined($2)? $2: $4, $3)}eg;

You're certainly at liberty to remove newlines from the code if the
count is that important to you. :-)

This one considers a field to be quoted only if it has (anchored)
at each end and internal quotes are \-escaped or doubled. All other \" 
and " that are not anchored in a quoted field are removed.

I haven't attempted any timings.

   - Brian

Brian Katzung, Kappa Computer Solutions, LLC
Leveraging UNIX, Linux, open source, and custom
software solutions for business and beyond
Phone: 877.367.8837 x1  http://www.kappacs.com

Chicago-talk mailing list
Chicago-talk at pm.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/chicago-talk/attachments/20071024/6b3bb8c4/attachment-0001.html 

More information about the Chicago-talk mailing list