<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:arial,helvetica,sans-serif;font-size:10pt"><div style="font-family: arial,helvetica,sans-serif; font-size: 10pt;">Thanks everyone. Here is the one-liner I am looking for!<br><span style="font-family: courier,monaco,monospace,sans-serif;">perl -ne 's/(?<!,)(?:(^")|")(?!,|$)/defined($1)?$1:""/ge?print:print' in.csv > out.csv</span><br>(-; I have not used my real first name 'Ge' for regex for a while ;-)<br>I believe the replace character and the delimiter can be a set of character, such as<span style="font-family: courier,monaco,monospace,sans-serif;"> ["'] [,:|].<br><br>Another question: How to use one-liner to split web log file?<br></span><br><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;">----- Original Message ----<br>From: Brian Katzung <briank@kappacs.com><br>To: Chicago.pm chatter
<chicago-talk@pm.org><br>Sent: Wednesday, October 24, 2007 8:33:11 PM<br>Subject: Re: [Chicago-talk] Removing Characters<br><br>tiger peng wrote:<br>> You are right if it is well formated CSV file.<br>> But I don't know if this is guaranteed. There is no document about
it. I <br>> just rewrite an old old uncommented/undocumented working scripts.<br>> <br>> The segment is just rewritten to make it little bit easier to read by
<br>> replacing the embedded 'DEL' charters with a variable populate with <br>> function chr.<br>> <br>> Below is the best I can do. It looks better to me and can run as two <br>> times faster than the old one does. I still cannot make out one-liner
<br>> for it. Can anyone get ride of the first line?<br>> <br>> my $leadingQ=""; $leadingQ='"' if /^"/; #save the leading quote if
it <br>> is there<br>> s/(?<!,)"(?!(,|$))//g; # remove all double quote not next to comma
or <br>> at the end of the line<br>> print OUTF $leadingQ, $_;<br><br>OK. Here's my one-liner version (well, two including the print) of the <br>above:<br><br>s/(?<!,)(?:(^")|")(?!,|$)/defined($1)?$1:''/eg;<br>print OUTF $_;<br><br>> ----- Original Message ----<br>> From: Steven Lembark <<a ymailto="mailto:lembark@wrkhors.com" href="mailto:lembark@wrkhors.com">lembark@wrkhors.com</a>><br>> To: Chicago.pm chatter <<a ymailto="mailto:chicago-talk@pm.org" href="mailto:chicago-talk@pm.org">chicago-talk@pm.org</a>><br>> Sent: Wednesday, October 24, 2007 2:14:48 PM<br>> Subject: Re: [Chicago-talk] Removing Characters<br>> <br>> > There must be better way for removing the double quote in a CSV
file<br>> > optionally quoted by double quote.<br>> > What I did as below is ugly and not reliable. Could anyone provide
one<br>> > beautify line?<br>> ><br>> > $delimiter=chr(0227);<br>> > s/^"/$delimiter/g;<br>> > s/,"/,$delimiter/g;<br>> > s/"$/$delimiter/g;<br>> > s/",/$delimiter,/g;<br>> > s/"//g;<br>> > s/$delimiter/"/g;<br>> <br>> You don't seem to want all of the quotes removed,<br>> only the embedded ones. If the data is well-formatted<br>> then the operation above will leave you with a bunch<br>> of naked backslashes in the text:<br>> <br>> "this is a \"double quoted\" text line"<br>> <br>> becomes<br>> <br>> "this is a \double quoted\ text line"<br>> <br>> and you probably don't want the \d or \ in your<br>> result.<br>> <br>> If the real problem is that fate has handed you some<br>> CSV data with embedded, un-escaped quotes then your<br>>
approach makes the most sense, but you'll have to<br>> remove escaped quotes also:<br>> <br>> s{ \\" }{}gx;<br>> <br>> will strip the \" char's. You might prefer to replace<br>> them with non-delimiting quotes, e.g.,<br>> <br>> <br>> s{ \\" }{'}gx;<br>> <br>> All of the CSV parsing modules assume "clean" CSV<br>> source (oxymoron?) so if you need to clean up botched<br>> data then some iterative approach is likely to be<br>> what you need.<br>> <br>> enjoi<br>> <br>> -- <br>> Steven Lembark 85-09 90th
Street<br><br>Here's a "smarter" but longer obfuscation:<br><br>s{((?:^|,)")((?:[^"]|\\"|"")*?)(")(?=,|$)|([^,]*)}{<br> (sub { my @a = @_; $a[1] =~ s/\\"|"//g;<br> join('', grep(defined, @a)); })-><br> ($1, defined($2)? $2: $4, $3)}eg;<br>print;<br><br>You're certainly at liberty to remove newlines from the code if the
line <br>count is that important to you. :-)<br><br>This one considers a field to be quoted only if it has (anchored)
quotes <br>at each end and internal quotes are \-escaped or doubled. All other \" <br>and " that are not anchored in a quoted field are removed.<br><br>I haven't attempted any timings.<br><br> - Brian<br><br>-- <br>Brian Katzung, Kappa Computer Solutions, LLC<br>Leveraging UNIX, Linux, open source, and custom<br>software solutions for business and beyond<br>Phone: 877.367.8837 x1 <a href="http://www.kappacs.com" target="_blank">http://www.kappacs.com</a><br><br>_______________________________________________<br>Chicago-talk mailing list<br><a ymailto="mailto:Chicago-talk@pm.org" href="mailto:Chicago-talk@pm.org">Chicago-talk@pm.org</a><br><a href="http://mail.pm.org/mailman/listinfo/chicago-talk" target="_blank">http://mail.pm.org/mailman/listinfo/chicago-talk</a><br></div><br></div></div></body></html>