[Chicago-talk] Removing Characters
tiger peng
tigerpeng2001 at yahoo.com
Wed Oct 24 20:22:22 PDT 2007
Thanks everyone. Here is the one-liner I am looking for!
perl -ne 's/(?<!,)(?:(^")|")(?!,|$)/defined($1)?$1:""/ge?print:print' in.csv > out.csv
(-; I have not used my real first name 'Ge' for regex for a while ;-)
I believe the replace character and the delimiter can be a set of character, such as ["'] [,:|].
Another question: How to use one-liner to split web log file?
----- Original Message ----
From: Brian Katzung <briank at kappacs.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Wednesday, October 24, 2007 8:33:11 PM
Subject: Re: [Chicago-talk] Removing Characters
tiger peng wrote:
> You are right if it is well formated CSV file.
> But I don't know if this is guaranteed. There is no document about
it. I
> just rewrite an old old uncommented/undocumented working scripts.
>
> The segment is just rewritten to make it little bit easier to read by
> replacing the embedded 'DEL' charters with a variable populate with
> function chr.
>
> Below is the best I can do. It looks better to me and can run as two
> times faster than the old one does. I still cannot make out one-liner
> for it. Can anyone get ride of the first line?
>
> my $leadingQ=""; $leadingQ='"' if /^"/; #save the leading quote if
it
> is there
> s/(?<!,)"(?!(,|$))//g; # remove all double quote not next to comma
or
> at the end of the line
> print OUTF $leadingQ, $_;
OK. Here's my one-liner version (well, two including the print) of the
above:
s/(?<!,)(?:(^")|")(?!,|$)/defined($1)?$1:''/eg;
print OUTF $_;
> ----- Original Message ----
> From: Steven Lembark <lembark at wrkhors.com>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Wednesday, October 24, 2007 2:14:48 PM
> Subject: Re: [Chicago-talk] Removing Characters
>
> > There must be better way for removing the double quote in a CSV
file
> > optionally quoted by double quote.
> > What I did as below is ugly and not reliable. Could anyone provide
one
> > beautify line?
> >
> > $delimiter=chr(0227);
> > s/^"/$delimiter/g;
> > s/,"/,$delimiter/g;
> > s/"$/$delimiter/g;
> > s/",/$delimiter,/g;
> > s/"//g;
> > s/$delimiter/"/g;
>
> You don't seem to want all of the quotes removed,
> only the embedded ones. If the data is well-formatted
> then the operation above will leave you with a bunch
> of naked backslashes in the text:
>
> "this is a \"double quoted\" text line"
>
> becomes
>
> "this is a \double quoted\ text line"
>
> and you probably don't want the \d or \ in your
> result.
>
> If the real problem is that fate has handed you some
> CSV data with embedded, un-escaped quotes then your
> approach makes the most sense, but you'll have to
> remove escaped quotes also:
>
> s{ \\" }{}gx;
>
> will strip the \" char's. You might prefer to replace
> them with non-delimiting quotes, e.g.,
>
>
> s{ \\" }{'}gx;
>
> All of the CSV parsing modules assume "clean" CSV
> source (oxymoron?) so if you need to clean up botched
> data then some iterative approach is likely to be
> what you need.
>
> enjoi
>
> --
> Steven Lembark 85-09 90th
Street
Here's a "smarter" but longer obfuscation:
s{((?:^|,)")((?:[^"]|\\"|"")*?)(")(?=,|$)|([^,]*)}{
(sub { my @a = @_; $a[1] =~ s/\\"|"//g;
join('', grep(defined, @a)); })->
($1, defined($2)? $2: $4, $3)}eg;
print;
You're certainly at liberty to remove newlines from the code if the
line
count is that important to you. :-)
This one considers a field to be quoted only if it has (anchored)
quotes
at each end and internal quotes are \-escaped or doubled. All other \"
and " that are not anchored in a quoted field are removed.
I haven't attempted any timings.
- Brian
--
Brian Katzung, Kappa Computer Solutions, LLC
Leveraging UNIX, Linux, open source, and custom
software solutions for business and beyond
Phone: 877.367.8837 x1 http://www.kappacs.com
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/chicago-talk/attachments/20071024/6b3bb8c4/attachment-0001.html
More information about the Chicago-talk
mailing list