Handling comma in string values in a CSV file [duplicate]












3















This question already has an answer here:




  • Is there a robust command line tool for processing csv files?

    17 answers




I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","?



Sample records



"prabhat,kumar",19,2000,"bangalore,India"


In AWK it should be



$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"


Setting FS="," is creating the problem.



Input is:



"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"


Output should be:



"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"


Code I am trying:



awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt









share|improve this question















marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.











  • 5




    And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
    – HBruijn
    Aug 11 '14 at 12:04
















3















This question already has an answer here:




  • Is there a robust command line tool for processing csv files?

    17 answers




I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","?



Sample records



"prabhat,kumar",19,2000,"bangalore,India"


In AWK it should be



$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"


Setting FS="," is creating the problem.



Input is:



"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"


Output should be:



"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"


Code I am trying:



awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt









share|improve this question















marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.











  • 5




    And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
    – HBruijn
    Aug 11 '14 at 12:04














3












3








3








This question already has an answer here:




  • Is there a robust command line tool for processing csv files?

    17 answers




I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","?



Sample records



"prabhat,kumar",19,2000,"bangalore,India"


In AWK it should be



$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"


Setting FS="," is creating the problem.



Input is:



"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"


Output should be:



"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"


Code I am trying:



awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt









share|improve this question
















This question already has an answer here:




  • Is there a robust command line tool for processing csv files?

    17 answers




I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","?



Sample records



"prabhat,kumar",19,2000,"bangalore,India"


In AWK it should be



$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"


Setting FS="," is creating the problem.



Input is:



"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"


Output should be:



"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"


Code I am trying:



awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="n"} {if($i ~ /^"/ || $i ~ /"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt




This question already has an answer here:




  • Is there a robust command line tool for processing csv files?

    17 answers








text-processing awk csv






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 17 '14 at 12:18









Peter Mortensen

88758




88758










asked Aug 11 '14 at 11:53









prabhat diwakerprabhat diwaker

16113




16113




marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






marked as duplicate by Gilles, Michael Homer, jasonwryan, Ramesh, cuonglm Aug 12 '14 at 3:29


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.










  • 5




    And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
    – HBruijn
    Aug 11 '14 at 12:04














  • 5




    And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
    – HBruijn
    Aug 11 '14 at 12:04








5




5




And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04




And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix
– HBruijn
Aug 11 '14 at 12:04










6 Answers
6






active

oldest

votes


















5














First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV:





  1. Install cpanm (if you use Perl, you'll thank me later)



    $ sudo apt-get install cpanminus


    If you're not on a Debian based system, you should be able to install it using your distribution's package manager.




  2. Install the Text::CSV module



    $ sudo cpanm Text::CSV



  3. Parse your file



    $ perl -MText::CSV -le '
    $csv = Text::CSV->new({binary=>1});
    while ($row = $csv->getline(STDIN)){
    print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
    1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India


    As you can see above, the 1st field is $row->[0], the second $row->[1] etc.






That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk normally and finally, switch them back to commas again. I am using ### here but you can use anything you are sure will never be in one of your fields.



$ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv | 
awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000





share|improve this answer























  • Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
    – Doug O'Neal
    Aug 11 '14 at 16:56












  • First field in the above should be "prabhat kumar" - sorry
    – Doug O'Neal
    Aug 11 '14 at 17:03










  • I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
    – Doug O'Neal
    Aug 11 '14 at 17:04












  • @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
    – terdon
    Aug 11 '14 at 17:07










  • OP asked for awk
    – Andrew
    Aug 15 '14 at 19:37



















3














If you have GNU awk:



$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India


The output format is a little ugly as I've only printed $1 and $4 next to each other - I'm sure you can alter it to your taste.



If you need to retain the double quotes around the fields, remove both gsub(); functions.



Explanation:



Normally, awk separates fields in a record by the content of the FS (Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk where a record ends. In a csv file, a record ends at a comma (passed to awk as -vFS=,), but of course, in examples similar to yours, this is too simple and it breaks.



Alternatively, the FPAT (Field Pattern) defines the record within awk. Instead of telling awk where the record ends, you create a definition that encompasses a whole record. In the complex csv of your example, this is [^,]*|"[^"]*"



This breaks down as follows:-




  • Characters that are not commas ([^,]) as many times as possible (*). Everything between two commas is a field.

  • Or (|)

  • A single double-quote (") followed by no double-quotes ([^"]) as many times as possible (*) followed by a single double-quote ("). Everything within double-quotes (including a comma) counts as one field.






share|improve this answer































    1














    Ruby is handy for CSV parsing:



    ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file




    prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
    prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS


    Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):



    ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file




    "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
    "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"





    share|improve this answer





















    • OP asked for awk
      – Andrew
      Aug 15 '14 at 19:38



















    0














    You can use perl instead:



    $ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
    perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
    "prabhat,kumar" "bangalore,India"





    share|improve this answer

















    • 1




      Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
      – terdon
      Aug 11 '14 at 12:53










    • @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
      – cuonglm
      Aug 11 '14 at 14:06



















    0














    This worked for me:



    $ echo '"prabhat,kumar",19,2000,"bangalore,India"' | 
    awk -F, '{print $1,$2,$3,$4,$5,$6}'|
    awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`





    share|improve this answer



















    • 1




      please add an explanation to your answer
      – Networker
      Aug 11 '14 at 13:54






    • 1




      This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
      – terdon
      Aug 11 '14 at 14:08










    • OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
      – Andrew
      Aug 15 '14 at 19:34





















    0














    Probably a qsv would be more your speed?



    sed 's/.*/,&,/;:t
    s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
    ' <<DATA
    "prabhat,kumar",19,2000,"bangalore,India"
    DATA


    OUTPUT



    "prabhat,kumar"19"2000"bangalore,India"





    share|improve this answer






























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      5














      First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV:





      1. Install cpanm (if you use Perl, you'll thank me later)



        $ sudo apt-get install cpanminus


        If you're not on a Debian based system, you should be able to install it using your distribution's package manager.




      2. Install the Text::CSV module



        $ sudo cpanm Text::CSV



      3. Parse your file



        $ perl -MText::CSV -le '
        $csv = Text::CSV->new({binary=>1});
        while ($row = $csv->getline(STDIN)){
        print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
        1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India


        As you can see above, the 1st field is $row->[0], the second $row->[1] etc.






      That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk normally and finally, switch them back to commas again. I am using ### here but you can use anything you are sure will never be in one of your fields.



      $ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv | 
      awk -F, '{print $1,$3}' | sed 's/###/,/g'
      "prabhat,kumar" 2000





      share|improve this answer























      • Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 16:56












      • First field in the above should be "prabhat kumar" - sorry
        – Doug O'Neal
        Aug 11 '14 at 17:03










      • I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 17:04












      • @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
        – terdon
        Aug 11 '14 at 17:07










      • OP asked for awk
        – Andrew
        Aug 15 '14 at 19:37
















      5














      First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV:





      1. Install cpanm (if you use Perl, you'll thank me later)



        $ sudo apt-get install cpanminus


        If you're not on a Debian based system, you should be able to install it using your distribution's package manager.




      2. Install the Text::CSV module



        $ sudo cpanm Text::CSV



      3. Parse your file



        $ perl -MText::CSV -le '
        $csv = Text::CSV->new({binary=>1});
        while ($row = $csv->getline(STDIN)){
        print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
        1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India


        As you can see above, the 1st field is $row->[0], the second $row->[1] etc.






      That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk normally and finally, switch them back to commas again. I am using ### here but you can use anything you are sure will never be in one of your fields.



      $ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv | 
      awk -F, '{print $1,$3}' | sed 's/###/,/g'
      "prabhat,kumar" 2000





      share|improve this answer























      • Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 16:56












      • First field in the above should be "prabhat kumar" - sorry
        – Doug O'Neal
        Aug 11 '14 at 17:03










      • I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 17:04












      • @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
        – terdon
        Aug 11 '14 at 17:07










      • OP asked for awk
        – Andrew
        Aug 15 '14 at 19:37














      5












      5








      5






      First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV:





      1. Install cpanm (if you use Perl, you'll thank me later)



        $ sudo apt-get install cpanminus


        If you're not on a Debian based system, you should be able to install it using your distribution's package manager.




      2. Install the Text::CSV module



        $ sudo cpanm Text::CSV



      3. Parse your file



        $ perl -MText::CSV -le '
        $csv = Text::CSV->new({binary=>1});
        while ($row = $csv->getline(STDIN)){
        print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
        1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India


        As you can see above, the 1st field is $row->[0], the second $row->[1] etc.






      That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk normally and finally, switch them back to commas again. I am using ### here but you can use anything you are sure will never be in one of your fields.



      $ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv | 
      awk -F, '{print $1,$3}' | sed 's/###/,/g'
      "prabhat,kumar" 2000





      share|improve this answer














      First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV:





      1. Install cpanm (if you use Perl, you'll thank me later)



        $ sudo apt-get install cpanminus


        If you're not on a Debian based system, you should be able to install it using your distribution's package manager.




      2. Install the Text::CSV module



        $ sudo cpanm Text::CSV



      3. Parse your file



        $ perl -MText::CSV -le '
        $csv = Text::CSV->new({binary=>1});
        while ($row = $csv->getline(STDIN)){
        print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv
        1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India


        As you can see above, the 1st field is $row->[0], the second $row->[1] etc.






      That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk normally and finally, switch them back to commas again. I am using ### here but you can use anything you are sure will never be in one of your fields.



      $ sed -r 's/("[^",]+),([^",]+")/1###2/g' file.csv | 
      awk -F, '{print $1,$3}' | sed 's/###/,/g'
      "prabhat,kumar" 2000






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 10 '14 at 14:49

























      answered Aug 11 '14 at 13:09









      terdonterdon

      129k32253428




      129k32253428












      • Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 16:56












      • First field in the above should be "prabhat kumar" - sorry
        – Doug O'Neal
        Aug 11 '14 at 17:03










      • I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 17:04












      • @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
        – terdon
        Aug 11 '14 at 17:07










      • OP asked for awk
        – Andrew
        Aug 15 '14 at 19:37


















      • Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 16:56












      • First field in the above should be "prabhat kumar" - sorry
        – Doug O'Neal
        Aug 11 '14 at 17:03










      • I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
        – Doug O'Neal
        Aug 11 '14 at 17:04












      • @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
        – terdon
        Aug 11 '14 at 17:07










      • OP asked for awk
        – Andrew
        Aug 15 '14 at 19:37
















      Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
      – Doug O'Neal
      Aug 11 '14 at 16:56






      Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
      – Doug O'Neal
      Aug 11 '14 at 16:56














      First field in the above should be "prabhat kumar" - sorry
      – Doug O'Neal
      Aug 11 '14 at 17:03




      First field in the above should be "prabhat kumar" - sorry
      – Doug O'Neal
      Aug 11 '14 at 17:03












      I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
      – Doug O'Neal
      Aug 11 '14 at 17:04






      I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/1###2/g' prints "prabhat kumar",19###2000,"bangalore,India"
      – Doug O'Neal
      Aug 11 '14 at 17:04














      @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
      – terdon
      Aug 11 '14 at 17:07




      @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead.
      – terdon
      Aug 11 '14 at 17:07












      OP asked for awk
      – Andrew
      Aug 15 '14 at 19:37




      OP asked for awk
      – Andrew
      Aug 15 '14 at 19:37













      3














      If you have GNU awk:



      $ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
      prabhat,kumarbangalore,India


      The output format is a little ugly as I've only printed $1 and $4 next to each other - I'm sure you can alter it to your taste.



      If you need to retain the double quotes around the fields, remove both gsub(); functions.



      Explanation:



      Normally, awk separates fields in a record by the content of the FS (Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk where a record ends. In a csv file, a record ends at a comma (passed to awk as -vFS=,), but of course, in examples similar to yours, this is too simple and it breaks.



      Alternatively, the FPAT (Field Pattern) defines the record within awk. Instead of telling awk where the record ends, you create a definition that encompasses a whole record. In the complex csv of your example, this is [^,]*|"[^"]*"



      This breaks down as follows:-




      • Characters that are not commas ([^,]) as many times as possible (*). Everything between two commas is a field.

      • Or (|)

      • A single double-quote (") followed by no double-quotes ([^"]) as many times as possible (*) followed by a single double-quote ("). Everything within double-quotes (including a comma) counts as one field.






      share|improve this answer




























        3














        If you have GNU awk:



        $ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
        prabhat,kumarbangalore,India


        The output format is a little ugly as I've only printed $1 and $4 next to each other - I'm sure you can alter it to your taste.



        If you need to retain the double quotes around the fields, remove both gsub(); functions.



        Explanation:



        Normally, awk separates fields in a record by the content of the FS (Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk where a record ends. In a csv file, a record ends at a comma (passed to awk as -vFS=,), but of course, in examples similar to yours, this is too simple and it breaks.



        Alternatively, the FPAT (Field Pattern) defines the record within awk. Instead of telling awk where the record ends, you create a definition that encompasses a whole record. In the complex csv of your example, this is [^,]*|"[^"]*"



        This breaks down as follows:-




        • Characters that are not commas ([^,]) as many times as possible (*). Everything between two commas is a field.

        • Or (|)

        • A single double-quote (") followed by no double-quotes ([^"]) as many times as possible (*) followed by a single double-quote ("). Everything within double-quotes (including a comma) counts as one field.






        share|improve this answer


























          3












          3








          3






          If you have GNU awk:



          $ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
          prabhat,kumarbangalore,India


          The output format is a little ugly as I've only printed $1 and $4 next to each other - I'm sure you can alter it to your taste.



          If you need to retain the double quotes around the fields, remove both gsub(); functions.



          Explanation:



          Normally, awk separates fields in a record by the content of the FS (Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk where a record ends. In a csv file, a record ends at a comma (passed to awk as -vFS=,), but of course, in examples similar to yours, this is too simple and it breaks.



          Alternatively, the FPAT (Field Pattern) defines the record within awk. Instead of telling awk where the record ends, you create a definition that encompasses a whole record. In the complex csv of your example, this is [^,]*|"[^"]*"



          This breaks down as follows:-




          • Characters that are not commas ([^,]) as many times as possible (*). Everything between two commas is a field.

          • Or (|)

          • A single double-quote (") followed by no double-quotes ([^"]) as many times as possible (*) followed by a single double-quote ("). Everything within double-quotes (including a comma) counts as one field.






          share|improve this answer














          If you have GNU awk:



          $ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^"|"$","",$1); gsub("^"|"$","",$4); print $1 $4} '
          prabhat,kumarbangalore,India


          The output format is a little ugly as I've only printed $1 and $4 next to each other - I'm sure you can alter it to your taste.



          If you need to retain the double quotes around the fields, remove both gsub(); functions.



          Explanation:



          Normally, awk separates fields in a record by the content of the FS (Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk where a record ends. In a csv file, a record ends at a comma (passed to awk as -vFS=,), but of course, in examples similar to yours, this is too simple and it breaks.



          Alternatively, the FPAT (Field Pattern) defines the record within awk. Instead of telling awk where the record ends, you create a definition that encompasses a whole record. In the complex csv of your example, this is [^,]*|"[^"]*"



          This breaks down as follows:-




          • Characters that are not commas ([^,]) as many times as possible (*). Everything between two commas is a field.

          • Or (|)

          • A single double-quote (") followed by no double-quotes ([^"]) as many times as possible (*) followed by a single double-quote ("). Everything within double-quotes (including a comma) counts as one field.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 29 '18 at 22:32









          Jeff Schaller

          39.2k1054125




          39.2k1054125










          answered Aug 11 '14 at 13:11









          garethTheRedgarethTheRed

          24.1k36180




          24.1k36180























              1














              Ruby is handy for CSV parsing:



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file




              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS


              Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file




              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"





              share|improve this answer





















              • OP asked for awk
                – Andrew
                Aug 15 '14 at 19:38
















              1














              Ruby is handy for CSV parsing:



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file




              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS


              Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file




              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"





              share|improve this answer





















              • OP asked for awk
                – Andrew
                Aug 15 '14 at 19:38














              1












              1








              1






              Ruby is handy for CSV parsing:



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file




              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS


              Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file




              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"





              share|improve this answer












              Ruby is handy for CSV parsing:



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file




              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
              prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS


              Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):



              ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file




              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
              "prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Aug 11 '14 at 18:27









              glenn jackmanglenn jackman

              50.5k570108




              50.5k570108












              • OP asked for awk
                – Andrew
                Aug 15 '14 at 19:38


















              • OP asked for awk
                – Andrew
                Aug 15 '14 at 19:38
















              OP asked for awk
              – Andrew
              Aug 15 '14 at 19:38




              OP asked for awk
              – Andrew
              Aug 15 '14 at 19:38











              0














              You can use perl instead:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
              perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
              "prabhat,kumar" "bangalore,India"





              share|improve this answer

















              • 1




                Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
                – terdon
                Aug 11 '14 at 12:53










              • @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
                – cuonglm
                Aug 11 '14 at 14:06
















              0














              You can use perl instead:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
              perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
              "prabhat,kumar" "bangalore,India"





              share|improve this answer

















              • 1




                Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
                – terdon
                Aug 11 '14 at 12:53










              • @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
                – cuonglm
                Aug 11 '14 at 14:06














              0












              0








              0






              You can use perl instead:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
              perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
              "prabhat,kumar" "bangalore,India"





              share|improve this answer












              You can use perl instead:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
              perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
              "prabhat,kumar" "bangalore,India"






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Aug 11 '14 at 12:07









              cuonglmcuonglm

              102k23202302




              102k23202302








              • 1




                Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
                – terdon
                Aug 11 '14 at 12:53










              • @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
                – cuonglm
                Aug 11 '14 at 14:06














              • 1




                Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
                – terdon
                Aug 11 '14 at 12:53










              • @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
                – cuonglm
                Aug 11 '14 at 14:06








              1




              1




              Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
              – terdon
              Aug 11 '14 at 12:53




              Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19,
              – terdon
              Aug 11 '14 at 12:53












              @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
              – cuonglm
              Aug 11 '14 at 14:06




              @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job.
              – cuonglm
              Aug 11 '14 at 14:06











              0














              This worked for me:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' | 
              awk -F, '{print $1,$2,$3,$4,$5,$6}'|
              awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`





              share|improve this answer



















              • 1




                please add an explanation to your answer
                – Networker
                Aug 11 '14 at 13:54






              • 1




                This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
                – terdon
                Aug 11 '14 at 14:08










              • OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
                – Andrew
                Aug 15 '14 at 19:34


















              0














              This worked for me:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' | 
              awk -F, '{print $1,$2,$3,$4,$5,$6}'|
              awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`





              share|improve this answer



















              • 1




                please add an explanation to your answer
                – Networker
                Aug 11 '14 at 13:54






              • 1




                This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
                – terdon
                Aug 11 '14 at 14:08










              • OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
                – Andrew
                Aug 15 '14 at 19:34
















              0












              0








              0






              This worked for me:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' | 
              awk -F, '{print $1,$2,$3,$4,$5,$6}'|
              awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`





              share|improve this answer














              This worked for me:



              $ echo '"prabhat,kumar",19,2000,"bangalore,India"' | 
              awk -F, '{print $1,$2,$3,$4,$5,$6}'|
              awk -F" '{print $2,$3,$4}'|awk -F '{print $1","$2,$3,$4,$5","$6}'`






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Aug 11 '14 at 14:07









              terdon

              129k32253428




              129k32253428










              answered Aug 11 '14 at 13:26









              AndrewAndrew

              93557




              93557








              • 1




                please add an explanation to your answer
                – Networker
                Aug 11 '14 at 13:54






              • 1




                This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
                – terdon
                Aug 11 '14 at 14:08










              • OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
                – Andrew
                Aug 15 '14 at 19:34
















              • 1




                please add an explanation to your answer
                – Networker
                Aug 11 '14 at 13:54






              • 1




                This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
                – terdon
                Aug 11 '14 at 14:08










              • OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
                – Andrew
                Aug 15 '14 at 19:34










              1




              1




              please add an explanation to your answer
              – Networker
              Aug 11 '14 at 13:54




              please add an explanation to your answer
              – Networker
              Aug 11 '14 at 13:54




              1




              1




              This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
              – terdon
              Aug 11 '14 at 14:08




              This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st?
              – terdon
              Aug 11 '14 at 14:08












              OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
              – Andrew
              Aug 15 '14 at 19:34






              OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem.
              – Andrew
              Aug 15 '14 at 19:34













              0














              Probably a qsv would be more your speed?



              sed 's/.*/,&,/;:t
              s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
              ' <<DATA
              "prabhat,kumar",19,2000,"bangalore,India"
              DATA


              OUTPUT



              "prabhat,kumar"19"2000"bangalore,India"





              share|improve this answer




























                0














                Probably a qsv would be more your speed?



                sed 's/.*/,&,/;:t
                s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
                ' <<DATA
                "prabhat,kumar",19,2000,"bangalore,India"
                DATA


                OUTPUT



                "prabhat,kumar"19"2000"bangalore,India"





                share|improve this answer


























                  0












                  0








                  0






                  Probably a qsv would be more your speed?



                  sed 's/.*/,&,/;:t
                  s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
                  ' <<DATA
                  "prabhat,kumar",19,2000,"bangalore,India"
                  DATA


                  OUTPUT



                  "prabhat,kumar"19"2000"bangalore,India"





                  share|improve this answer














                  Probably a qsv would be more your speed?



                  sed 's/.*/,&,/;:t
                  s/,"/"/;s/",/"/;s/,([^"]*),/"1/;tt
                  ' <<DATA
                  "prabhat,kumar",19,2000,"bangalore,India"
                  DATA


                  OUTPUT



                  "prabhat,kumar"19"2000"bangalore,India"






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Aug 11 '14 at 22:17

























                  answered Aug 11 '14 at 21:46









                  mikeservmikeserv

                  45.3k568154




                  45.3k568154















                      Popular posts from this blog

                      Morgemoulin

                      Scott Moir

                      Souastre