Handling comma in string values in a CSV file [duplicate]
This question already has an answer here:
Is there a robust command line tool for processing csv files?
17 answers
I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","
?
Sample records
"prabhat,kumar",19,2000,"bangalore,India"
In AWK it should be
$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"
Setting FS=","
is creating the problem.
Input is:
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
Output should be:
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
Code I am trying:
awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt
text-processing awk csv
marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
Is there a robust command line tool for processing csv files?
17 answers
I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","
?
Sample records
"prabhat,kumar",19,2000,"bangalore,India"
In AWK it should be
$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"
Setting FS=","
is creating the problem.
Input is:
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
Output should be:
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
Code I am trying:
awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt
text-processing awk csv
marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
5
And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04
add a comment |
This question already has an answer here:
Is there a robust command line tool for processing csv files?
17 answers
I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","
?
Sample records
"prabhat,kumar",19,2000,"bangalore,India"
In AWK it should be
$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"
Setting FS=","
is creating the problem.
Input is:
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
Output should be:
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
Code I am trying:
awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt
text-processing awk csv
This question already has an answer here:
Is there a robust command line tool for processing csv files?
17 answers
I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","
?
Sample records
"prabhat,kumar",19,2000,"bangalore,India"
In AWK it should be
$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"
Setting FS=","
is creating the problem.
Input is:
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
Output should be:
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
Code I am trying:
awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt
This question already has an answer here:
Is there a robust command line tool for processing csv files?
17 answers
text-processing awk csv
text-processing awk csv
edited Aug 17 '14 at 12:18
Peter Mortensen
88758
88758
asked Aug 11 '14 at 11:53
prabhat diwakerprabhat diwaker
16113
16113
marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
5
And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04
add a comment |
5
And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04
5
5
And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04
And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04
add a comment |
6 Answers
6
active
oldest
votes
First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV
:
Install
cpanm
(if you use Perl, you'll thank me later)
$ sudo apt-get install cpanminus
If you're not on a Debian based system, you should be able to install it using your distribution's package manager.
Install the
Text::CSV
module
$ sudo cpanm Text::CSV
Parse your file
$ perl -MText::CSV -le '
$csv = Text::CSV->new({binary=>1});
while ($row = $csv->getline(STDIN)){
print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India
As you can see above, the 1st field is
$row->[0]
, the second$row->[1]
etc.
That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk
normally and finally, switch them back to commas again. I am using ###
here but you can use anything you are sure will never be in one of your fields.
$ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv |
awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000
Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
First field in the above should be"prabhat kumar"
- sorry
– Doug O'Neal
Aug 11 '14 at 17:03
I cut the wrong line:echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
|
show 1 more comment
If you have GNU awk
:
$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India
The output format is a little ugly as I've only printed $1
and $4
next to each other - I'm sure you can alter it to your taste.
If you need to retain the double quotes around the fields, remove both gsub();
functions.
Explanation:
Normally, awk
separates fields in a record by the content of the FS
(Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk
where a record ends. In a csv
file, a record ends at a comma (passed to awk
as -vFS=,
), but of course, in examples similar to yours, this is too simple and it breaks.
Alternatively, the FPAT
(Field Pattern) defines the record within awk
. Instead of telling awk
where the record ends, you create a definition that encompasses a whole record. In the complex csv
of your example, this is [^,]*|"[^"]*"
This breaks down as follows:-
- Characters that are not commas (
[^,]
) as many times as possible (*
). Everything between two commas is a field. - Or (
|
) - A single double-quote (
"
) followed by no double-quotes ([^"]
) as many times as possible (*
) followed by a single double-quote ("
). Everything within double-quotes (including a comma) counts as one field.
add a comment |
Ruby is handy for CSV parsing:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
add a comment |
You can use perl
instead:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
"prabhat,kumar" "bangalore,India"
1
Clever! However, it will fail for cases where there is a non letter after the comma. For exampleprabhat, (kumar)
or a letter after a non-quoted comma:"prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
@terdon: Yes, of course, as it's desired. Your answer usingText:CSV
is better for this job.
– cuonglm
Aug 11 '14 at 14:06
add a comment |
This worked for me:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
awk -F, '{print $1,$2,$3,$4,$5,$6}'|
awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`
1
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
1
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
add a comment |
Probably a qsv would be more your speed?
sed 's/.*/,&,/;:t
s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
' <<DATA
"prabhat,kumar",19,2000,"bangalore,India"
DATA
OUTPUT
"prabhat,kumar"19"2000"bangalore,India"
add a comment |
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV
:
Install
cpanm
(if you use Perl, you'll thank me later)
$ sudo apt-get install cpanminus
If you're not on a Debian based system, you should be able to install it using your distribution's package manager.
Install the
Text::CSV
module
$ sudo cpanm Text::CSV
Parse your file
$ perl -MText::CSV -le '
$csv = Text::CSV->new({binary=>1});
while ($row = $csv->getline(STDIN)){
print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India
As you can see above, the 1st field is
$row->[0]
, the second$row->[1]
etc.
That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk
normally and finally, switch them back to commas again. I am using ###
here but you can use anything you are sure will never be in one of your fields.
$ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv |
awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000
Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
First field in the above should be"prabhat kumar"
- sorry
– Doug O'Neal
Aug 11 '14 at 17:03
I cut the wrong line:echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
|
show 1 more comment
First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV
:
Install
cpanm
(if you use Perl, you'll thank me later)
$ sudo apt-get install cpanminus
If you're not on a Debian based system, you should be able to install it using your distribution's package manager.
Install the
Text::CSV
module
$ sudo cpanm Text::CSV
Parse your file
$ perl -MText::CSV -le '
$csv = Text::CSV->new({binary=>1});
while ($row = $csv->getline(STDIN)){
print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India
As you can see above, the 1st field is
$row->[0]
, the second$row->[1]
etc.
That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk
normally and finally, switch them back to commas again. I am using ###
here but you can use anything you are sure will never be in one of your fields.
$ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv |
awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000
Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
First field in the above should be"prabhat kumar"
- sorry
– Doug O'Neal
Aug 11 '14 at 17:03
I cut the wrong line:echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
|
show 1 more comment
First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV
:
Install
cpanm
(if you use Perl, you'll thank me later)
$ sudo apt-get install cpanminus
If you're not on a Debian based system, you should be able to install it using your distribution's package manager.
Install the
Text::CSV
module
$ sudo cpanm Text::CSV
Parse your file
$ perl -MText::CSV -le '
$csv = Text::CSV->new({binary=>1});
while ($row = $csv->getline(STDIN)){
print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India
As you can see above, the 1st field is
$row->[0]
, the second$row->[1]
etc.
That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk
normally and finally, switch them back to commas again. I am using ###
here but you can use anything you are sure will never be in one of your fields.
$ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv |
awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000
First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV
:
Install
cpanm
(if you use Perl, you'll thank me later)
$ sudo apt-get install cpanminus
If you're not on a Debian based system, you should be able to install it using your distribution's package manager.
Install the
Text::CSV
module
$ sudo cpanm Text::CSV
Parse your file
$ perl -MText::CSV -le '
$csv = Text::CSV->new({binary=>1});
while ($row = $csv->getline(STDIN)){
print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India
As you can see above, the 1st field is
$row->[0]
, the second$row->[1]
etc.
That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk
normally and finally, switch them back to commas again. I am using ###
here but you can use anything you are sure will never be in one of your fields.
$ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv |
awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000
edited Nov 10 '14 at 14:49
answered Aug 11 '14 at 13:09
terdon♦terdon
129k32253428
129k32253428
Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
First field in the above should be"prabhat kumar"
- sorry
– Doug O'Neal
Aug 11 '14 at 17:03
I cut the wrong line:echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
|
show 1 more comment
Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
First field in the above should be"prabhat kumar"
- sorry
– Doug O'Neal
Aug 11 '14 at 17:03
I cut the wrong line:echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints"prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
Your script has a problem if the quoted field doesn't contain a comma.
echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints "prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
Your script has a problem if the quoted field doesn't contain a comma.
echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints "prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 16:56
First field in the above should be
"prabhat kumar"
- sorry– Doug O'Neal
Aug 11 '14 at 17:03
First field in the above should be
"prabhat kumar"
- sorry– Doug O'Neal
Aug 11 '14 at 17:03
I cut the wrong line:
echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints "prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
I cut the wrong line:
echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g'
prints "prabhat kumar",19###2000,"bangalore,India"
– Doug O'Neal
Aug 11 '14 at 17:04
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
@DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
– terdon♦
Aug 11 '14 at 17:07
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
OP asked for awk
– Andrew
Aug 15 '14 at 19:37
|
show 1 more comment
If you have GNU awk
:
$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India
The output format is a little ugly as I've only printed $1
and $4
next to each other - I'm sure you can alter it to your taste.
If you need to retain the double quotes around the fields, remove both gsub();
functions.
Explanation:
Normally, awk
separates fields in a record by the content of the FS
(Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk
where a record ends. In a csv
file, a record ends at a comma (passed to awk
as -vFS=,
), but of course, in examples similar to yours, this is too simple and it breaks.
Alternatively, the FPAT
(Field Pattern) defines the record within awk
. Instead of telling awk
where the record ends, you create a definition that encompasses a whole record. In the complex csv
of your example, this is [^,]*|"[^"]*"
This breaks down as follows:-
- Characters that are not commas (
[^,]
) as many times as possible (*
). Everything between two commas is a field. - Or (
|
) - A single double-quote (
"
) followed by no double-quotes ([^"]
) as many times as possible (*
) followed by a single double-quote ("
). Everything within double-quotes (including a comma) counts as one field.
add a comment |
If you have GNU awk
:
$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India
The output format is a little ugly as I've only printed $1
and $4
next to each other - I'm sure you can alter it to your taste.
If you need to retain the double quotes around the fields, remove both gsub();
functions.
Explanation:
Normally, awk
separates fields in a record by the content of the FS
(Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk
where a record ends. In a csv
file, a record ends at a comma (passed to awk
as -vFS=,
), but of course, in examples similar to yours, this is too simple and it breaks.
Alternatively, the FPAT
(Field Pattern) defines the record within awk
. Instead of telling awk
where the record ends, you create a definition that encompasses a whole record. In the complex csv
of your example, this is [^,]*|"[^"]*"
This breaks down as follows:-
- Characters that are not commas (
[^,]
) as many times as possible (*
). Everything between two commas is a field. - Or (
|
) - A single double-quote (
"
) followed by no double-quotes ([^"]
) as many times as possible (*
) followed by a single double-quote ("
). Everything within double-quotes (including a comma) counts as one field.
add a comment |
If you have GNU awk
:
$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India
The output format is a little ugly as I've only printed $1
and $4
next to each other - I'm sure you can alter it to your taste.
If you need to retain the double quotes around the fields, remove both gsub();
functions.
Explanation:
Normally, awk
separates fields in a record by the content of the FS
(Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk
where a record ends. In a csv
file, a record ends at a comma (passed to awk
as -vFS=,
), but of course, in examples similar to yours, this is too simple and it breaks.
Alternatively, the FPAT
(Field Pattern) defines the record within awk
. Instead of telling awk
where the record ends, you create a definition that encompasses a whole record. In the complex csv
of your example, this is [^,]*|"[^"]*"
This breaks down as follows:-
- Characters that are not commas (
[^,]
) as many times as possible (*
). Everything between two commas is a field. - Or (
|
) - A single double-quote (
"
) followed by no double-quotes ([^"]
) as many times as possible (*
) followed by a single double-quote ("
). Everything within double-quotes (including a comma) counts as one field.
If you have GNU awk
:
$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India
The output format is a little ugly as I've only printed $1
and $4
next to each other - I'm sure you can alter it to your taste.
If you need to retain the double quotes around the fields, remove both gsub();
functions.
Explanation:
Normally, awk
separates fields in a record by the content of the FS
(Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk
where a record ends. In a csv
file, a record ends at a comma (passed to awk
as -vFS=,
), but of course, in examples similar to yours, this is too simple and it breaks.
Alternatively, the FPAT
(Field Pattern) defines the record within awk
. Instead of telling awk
where the record ends, you create a definition that encompasses a whole record. In the complex csv
of your example, this is [^,]*|"[^"]*"
This breaks down as follows:-
- Characters that are not commas (
[^,]
) as many times as possible (*
). Everything between two commas is a field. - Or (
|
) - A single double-quote (
"
) followed by no double-quotes ([^"]
) as many times as possible (*
) followed by a single double-quote ("
). Everything within double-quotes (including a comma) counts as one field.
edited Apr 29 '18 at 22:32
Jeff Schaller
39.2k1054125
39.2k1054125
answered Aug 11 '14 at 13:11
garethTheRedgarethTheRed
24.1k36180
24.1k36180
add a comment |
add a comment |
Ruby is handy for CSV parsing:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
add a comment |
Ruby is handy for CSV parsing:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
add a comment |
Ruby is handy for CSV parsing:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
Ruby is handy for CSV parsing:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
answered Aug 11 '14 at 18:27
glenn jackmanglenn jackman
50.5k570108
50.5k570108
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
add a comment |
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
OP asked for awk
– Andrew
Aug 15 '14 at 19:38
add a comment |
You can use perl
instead:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
"prabhat,kumar" "bangalore,India"
1
Clever! However, it will fail for cases where there is a non letter after the comma. For exampleprabhat, (kumar)
or a letter after a non-quoted comma:"prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
@terdon: Yes, of course, as it's desired. Your answer usingText:CSV
is better for this job.
– cuonglm
Aug 11 '14 at 14:06
add a comment |
You can use perl
instead:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
"prabhat,kumar" "bangalore,India"
1
Clever! However, it will fail for cases where there is a non letter after the comma. For exampleprabhat, (kumar)
or a letter after a non-quoted comma:"prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
@terdon: Yes, of course, as it's desired. Your answer usingText:CSV
is better for this job.
– cuonglm
Aug 11 '14 at 14:06
add a comment |
You can use perl
instead:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
"prabhat,kumar" "bangalore,India"
You can use perl
instead:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
"prabhat,kumar" "bangalore,India"
answered Aug 11 '14 at 12:07
cuonglmcuonglm
102k23202302
102k23202302
1
Clever! However, it will fail for cases where there is a non letter after the comma. For exampleprabhat, (kumar)
or a letter after a non-quoted comma:"prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
@terdon: Yes, of course, as it's desired. Your answer usingText:CSV
is better for this job.
– cuonglm
Aug 11 '14 at 14:06
add a comment |
1
Clever! However, it will fail for cases where there is a non letter after the comma. For exampleprabhat, (kumar)
or a letter after a non-quoted comma:"prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
@terdon: Yes, of course, as it's desired. Your answer usingText:CSV
is better for this job.
– cuonglm
Aug 11 '14 at 14:06
1
1
Clever! However, it will fail for cases where there is a non letter after the comma. For example
prabhat, (kumar)
or a letter after a non-quoted comma: "prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
Clever! However, it will fail for cases where there is a non letter after the comma. For example
prabhat, (kumar)
or a letter after a non-quoted comma: "prabhat,kumar",A 19,
– terdon♦
Aug 11 '14 at 12:53
@terdon: Yes, of course, as it's desired. Your answer using
Text:CSV
is better for this job.– cuonglm
Aug 11 '14 at 14:06
@terdon: Yes, of course, as it's desired. Your answer using
Text:CSV
is better for this job.– cuonglm
Aug 11 '14 at 14:06
add a comment |
This worked for me:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
awk -F, '{print $1,$2,$3,$4,$5,$6}'|
awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`
1
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
1
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
add a comment |
This worked for me:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
awk -F, '{print $1,$2,$3,$4,$5,$6}'|
awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`
1
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
1
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
add a comment |
This worked for me:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
awk -F, '{print $1,$2,$3,$4,$5,$6}'|
awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`
This worked for me:
$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
awk -F, '{print $1,$2,$3,$4,$5,$6}'|
awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`
edited Aug 11 '14 at 14:07
terdon♦
129k32253428
129k32253428
answered Aug 11 '14 at 13:26
AndrewAndrew
93557
93557
1
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
1
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
add a comment |
1
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
1
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
1
1
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
please add an explanation to your answer
– Networker
Aug 11 '14 at 13:54
1
1
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
– terdon♦
Aug 11 '14 at 14:08
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
– Andrew
Aug 15 '14 at 19:34
add a comment |
Probably a qsv would be more your speed?
sed 's/.*/,&,/;:t
s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
' <<DATA
"prabhat,kumar",19,2000,"bangalore,India"
DATA
OUTPUT
"prabhat,kumar"19"2000"bangalore,India"
add a comment |
Probably a qsv would be more your speed?
sed 's/.*/,&,/;:t
s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
' <<DATA
"prabhat,kumar",19,2000,"bangalore,India"
DATA
OUTPUT
"prabhat,kumar"19"2000"bangalore,India"
add a comment |
Probably a qsv would be more your speed?
sed 's/.*/,&,/;:t
s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
' <<DATA
"prabhat,kumar",19,2000,"bangalore,India"
DATA
OUTPUT
"prabhat,kumar"19"2000"bangalore,India"
Probably a qsv would be more your speed?
sed 's/.*/,&,/;:t
s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
' <<DATA
"prabhat,kumar",19,2000,"bangalore,India"
DATA
OUTPUT
"prabhat,kumar"19"2000"bangalore,India"
edited Aug 11 '14 at 22:17
answered Aug 11 '14 at 21:46
mikeservmikeserv
45.3k568154
45.3k568154
add a comment |
add a comment |
5
And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04