combine two files to single file with combined columns











up vote
2
down vote

favorite












I need to combine two files into a single file with all columns from both files.



I am providing my example files.
File 1



chr loc T1  C1
chr1 100 2 3
chr1 200 3 4
chr2 100 1 4
chr2 400 3 1


File 2



chr loc T2  C2
chr1 100 1 2
chr1 300 4 1
chr2 100 7 5
chr2 500 1 9


and output file should be like this



output file



chr loc T1  C1  T2  C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9









share|improve this question




























    up vote
    2
    down vote

    favorite












    I need to combine two files into a single file with all columns from both files.



    I am providing my example files.
    File 1



    chr loc T1  C1
    chr1 100 2 3
    chr1 200 3 4
    chr2 100 1 4
    chr2 400 3 1


    File 2



    chr loc T2  C2
    chr1 100 1 2
    chr1 300 4 1
    chr2 100 7 5
    chr2 500 1 9


    and output file should be like this



    output file



    chr loc T1  C1  T2  C2
    chr1 100 2 3 1 2
    chr1 200 3 4 0 0
    chr1 300 0 0 4 1
    chr2 100 1 4 7 5
    chr2 400 3 1 0 0
    chr2 500 0 0 1 9









    share|improve this question


























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I need to combine two files into a single file with all columns from both files.



      I am providing my example files.
      File 1



      chr loc T1  C1
      chr1 100 2 3
      chr1 200 3 4
      chr2 100 1 4
      chr2 400 3 1


      File 2



      chr loc T2  C2
      chr1 100 1 2
      chr1 300 4 1
      chr2 100 7 5
      chr2 500 1 9


      and output file should be like this



      output file



      chr loc T1  C1  T2  C2
      chr1 100 2 3 1 2
      chr1 200 3 4 0 0
      chr1 300 0 0 4 1
      chr2 100 1 4 7 5
      chr2 400 3 1 0 0
      chr2 500 0 0 1 9









      share|improve this question















      I need to combine two files into a single file with all columns from both files.



      I am providing my example files.
      File 1



      chr loc T1  C1
      chr1 100 2 3
      chr1 200 3 4
      chr2 100 1 4
      chr2 400 3 1


      File 2



      chr loc T2  C2
      chr1 100 1 2
      chr1 300 4 1
      chr2 100 7 5
      chr2 500 1 9


      and output file should be like this



      output file



      chr loc T1  C1  T2  C2
      chr1 100 2 3 1 2
      chr1 200 3 4 0 0
      chr1 300 0 0 4 1
      chr2 100 1 4 7 5
      chr2 400 3 1 0 0
      chr2 500 0 0 1 9






      shell text-processing join






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 24 at 20:12









      Rui F Ribeiro

      38.3k1476127




      38.3k1476127










      asked Jul 16 '15 at 20:52









      Naresh DJ

      4714




      4714






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          5
          down vote



          accepted










          join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3 
          <(sed 's/ +/_/' file1 | sort)
          <(sed 's/ +/_/' file2 | sort) |
          sed 's/_/ /' |
          column -t |
          sort




          chr   loc  T1  C1  T2  C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9


          The trickiest part here are the reasons for sed -- join will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100, chr1_200, etc.



          join requires its input files to be sorted.



          I use process substitution so that join can work with the sed|sort pipelines like files.



          Then another sed call to undo the combined field, and then column to make it pretty.



          By default, join uses the first field of each file as the key field.



          By default, join does an inner join: only keys present in both files are printed. The -a1 and -a2 option enable the full outer join we want. The -e option provides the default value for null fields, and we need the -o option to specify that we want all the fields.





          Can also use awk:



          awk '
          {key = $1 OFS $2}
          NR == FNR {f1[key] = $3; f2[key] = $4; next}
          !(key in f1) {print $1, $2, 0, 0, $3, $4; next}
          {print key, f1[key], f2[key], $3, $4; delete f1[key]}
          END {for (key in f1) print key, f1[key], f2[key], 0, 0}
          ' file1 file2 | sort




          chr loc T1 C1 T2 C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9





          share|improve this answer























          • join is still greek to me. I've never yet climbed that mountain.
            – mikeserv
            Jul 16 '15 at 21:15










          • Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
            – Naresh DJ
            Jul 16 '15 at 21:40












          • @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
            – don_crissti
            Jul 17 '15 at 15:36













          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "106"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f216569%2fcombine-two-files-to-single-file-with-combined-columns%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          5
          down vote



          accepted










          join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3 
          <(sed 's/ +/_/' file1 | sort)
          <(sed 's/ +/_/' file2 | sort) |
          sed 's/_/ /' |
          column -t |
          sort




          chr   loc  T1  C1  T2  C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9


          The trickiest part here are the reasons for sed -- join will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100, chr1_200, etc.



          join requires its input files to be sorted.



          I use process substitution so that join can work with the sed|sort pipelines like files.



          Then another sed call to undo the combined field, and then column to make it pretty.



          By default, join uses the first field of each file as the key field.



          By default, join does an inner join: only keys present in both files are printed. The -a1 and -a2 option enable the full outer join we want. The -e option provides the default value for null fields, and we need the -o option to specify that we want all the fields.





          Can also use awk:



          awk '
          {key = $1 OFS $2}
          NR == FNR {f1[key] = $3; f2[key] = $4; next}
          !(key in f1) {print $1, $2, 0, 0, $3, $4; next}
          {print key, f1[key], f2[key], $3, $4; delete f1[key]}
          END {for (key in f1) print key, f1[key], f2[key], 0, 0}
          ' file1 file2 | sort




          chr loc T1 C1 T2 C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9





          share|improve this answer























          • join is still greek to me. I've never yet climbed that mountain.
            – mikeserv
            Jul 16 '15 at 21:15










          • Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
            – Naresh DJ
            Jul 16 '15 at 21:40












          • @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
            – don_crissti
            Jul 17 '15 at 15:36

















          up vote
          5
          down vote



          accepted










          join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3 
          <(sed 's/ +/_/' file1 | sort)
          <(sed 's/ +/_/' file2 | sort) |
          sed 's/_/ /' |
          column -t |
          sort




          chr   loc  T1  C1  T2  C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9


          The trickiest part here are the reasons for sed -- join will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100, chr1_200, etc.



          join requires its input files to be sorted.



          I use process substitution so that join can work with the sed|sort pipelines like files.



          Then another sed call to undo the combined field, and then column to make it pretty.



          By default, join uses the first field of each file as the key field.



          By default, join does an inner join: only keys present in both files are printed. The -a1 and -a2 option enable the full outer join we want. The -e option provides the default value for null fields, and we need the -o option to specify that we want all the fields.





          Can also use awk:



          awk '
          {key = $1 OFS $2}
          NR == FNR {f1[key] = $3; f2[key] = $4; next}
          !(key in f1) {print $1, $2, 0, 0, $3, $4; next}
          {print key, f1[key], f2[key], $3, $4; delete f1[key]}
          END {for (key in f1) print key, f1[key], f2[key], 0, 0}
          ' file1 file2 | sort




          chr loc T1 C1 T2 C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9





          share|improve this answer























          • join is still greek to me. I've never yet climbed that mountain.
            – mikeserv
            Jul 16 '15 at 21:15










          • Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
            – Naresh DJ
            Jul 16 '15 at 21:40












          • @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
            – don_crissti
            Jul 17 '15 at 15:36















          up vote
          5
          down vote



          accepted







          up vote
          5
          down vote



          accepted






          join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3 
          <(sed 's/ +/_/' file1 | sort)
          <(sed 's/ +/_/' file2 | sort) |
          sed 's/_/ /' |
          column -t |
          sort




          chr   loc  T1  C1  T2  C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9


          The trickiest part here are the reasons for sed -- join will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100, chr1_200, etc.



          join requires its input files to be sorted.



          I use process substitution so that join can work with the sed|sort pipelines like files.



          Then another sed call to undo the combined field, and then column to make it pretty.



          By default, join uses the first field of each file as the key field.



          By default, join does an inner join: only keys present in both files are printed. The -a1 and -a2 option enable the full outer join we want. The -e option provides the default value for null fields, and we need the -o option to specify that we want all the fields.





          Can also use awk:



          awk '
          {key = $1 OFS $2}
          NR == FNR {f1[key] = $3; f2[key] = $4; next}
          !(key in f1) {print $1, $2, 0, 0, $3, $4; next}
          {print key, f1[key], f2[key], $3, $4; delete f1[key]}
          END {for (key in f1) print key, f1[key], f2[key], 0, 0}
          ' file1 file2 | sort




          chr loc T1 C1 T2 C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9





          share|improve this answer














          join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3 
          <(sed 's/ +/_/' file1 | sort)
          <(sed 's/ +/_/' file2 | sort) |
          sed 's/_/ /' |
          column -t |
          sort




          chr   loc  T1  C1  T2  C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9


          The trickiest part here are the reasons for sed -- join will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100, chr1_200, etc.



          join requires its input files to be sorted.



          I use process substitution so that join can work with the sed|sort pipelines like files.



          Then another sed call to undo the combined field, and then column to make it pretty.



          By default, join uses the first field of each file as the key field.



          By default, join does an inner join: only keys present in both files are printed. The -a1 and -a2 option enable the full outer join we want. The -e option provides the default value for null fields, and we need the -o option to specify that we want all the fields.





          Can also use awk:



          awk '
          {key = $1 OFS $2}
          NR == FNR {f1[key] = $3; f2[key] = $4; next}
          !(key in f1) {print $1, $2, 0, 0, $3, $4; next}
          {print key, f1[key], f2[key], $3, $4; delete f1[key]}
          END {for (key in f1) print key, f1[key], f2[key], 0, 0}
          ' file1 file2 | sort




          chr loc T1 C1 T2 C2
          chr1 100 2 3 1 2
          chr1 200 3 4 0 0
          chr1 300 0 0 4 1
          chr2 100 1 4 7 5
          chr2 400 3 1 0 0
          chr2 500 0 0 1 9






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jul 17 '15 at 14:01

























          answered Jul 16 '15 at 21:13









          glenn jackman

          49.6k569106




          49.6k569106












          • join is still greek to me. I've never yet climbed that mountain.
            – mikeserv
            Jul 16 '15 at 21:15










          • Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
            – Naresh DJ
            Jul 16 '15 at 21:40












          • @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
            – don_crissti
            Jul 17 '15 at 15:36




















          • join is still greek to me. I've never yet climbed that mountain.
            – mikeserv
            Jul 16 '15 at 21:15










          • Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
            – Naresh DJ
            Jul 16 '15 at 21:40












          • @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
            – don_crissti
            Jul 17 '15 at 15:36


















          join is still greek to me. I've never yet climbed that mountain.
          – mikeserv
          Jul 16 '15 at 21:15




          join is still greek to me. I've never yet climbed that mountain.
          – mikeserv
          Jul 16 '15 at 21:15












          Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
          – Naresh DJ
          Jul 16 '15 at 21:40






          Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
          – Naresh DJ
          Jul 16 '15 at 21:40














          @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
          – don_crissti
          Jul 17 '15 at 15:36






          @NareshDJ - you're getting the header on the last line because of your locale; add LANG=C or LC_ALL=C before the last sort e.g. ... | LANG=C sort
          – don_crissti
          Jul 17 '15 at 15:36




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Unix & Linux Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f216569%2fcombine-two-files-to-single-file-with-combined-columns%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Morgemoulin

          Scott Moir

          Souastre