combine two files to single file with combined columns
up vote
2
down vote
favorite
I need to combine two files into a single file with all columns from both files.
I am providing my example files.
File 1
chr loc T1 C1
chr1 100 2 3
chr1 200 3 4
chr2 100 1 4
chr2 400 3 1
File 2
chr loc T2 C2
chr1 100 1 2
chr1 300 4 1
chr2 100 7 5
chr2 500 1 9
and output file should be like this
output file
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
shell text-processing join
add a comment |
up vote
2
down vote
favorite
I need to combine two files into a single file with all columns from both files.
I am providing my example files.
File 1
chr loc T1 C1
chr1 100 2 3
chr1 200 3 4
chr2 100 1 4
chr2 400 3 1
File 2
chr loc T2 C2
chr1 100 1 2
chr1 300 4 1
chr2 100 7 5
chr2 500 1 9
and output file should be like this
output file
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
shell text-processing join
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I need to combine two files into a single file with all columns from both files.
I am providing my example files.
File 1
chr loc T1 C1
chr1 100 2 3
chr1 200 3 4
chr2 100 1 4
chr2 400 3 1
File 2
chr loc T2 C2
chr1 100 1 2
chr1 300 4 1
chr2 100 7 5
chr2 500 1 9
and output file should be like this
output file
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
shell text-processing join
I need to combine two files into a single file with all columns from both files.
I am providing my example files.
File 1
chr loc T1 C1
chr1 100 2 3
chr1 200 3 4
chr2 100 1 4
chr2 400 3 1
File 2
chr loc T2 C2
chr1 100 1 2
chr1 300 4 1
chr2 100 7 5
chr2 500 1 9
and output file should be like this
output file
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
shell text-processing join
shell text-processing join
edited Nov 24 at 20:12
Rui F Ribeiro
38.3k1476127
38.3k1476127
asked Jul 16 '15 at 20:52
Naresh DJ
4714
4714
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
5
down vote
accepted
join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3
<(sed 's/ +/_/' file1 | sort)
<(sed 's/ +/_/' file2 | sort) |
sed 's/_/ /' |
column -t |
sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
The trickiest part here are the reasons for sed
-- join
will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100
, chr1_200
, etc.
join
requires its input files to be sorted.
I use process substitution so that join
can work with the sed|sort
pipelines like files.
Then another sed
call to undo the combined field, and then column
to make it pretty.
By default, join
uses the first field of each file as the key field.
By default, join
does an inner join: only keys present in both files are printed. The -a1
and -a2
option enable the full outer join we want. The -e
option provides the default value for null fields, and we need the -o
option to specify that we want all the fields.
Can also use awk:
awk '
{key = $1 OFS $2}
NR == FNR {f1[key] = $3; f2[key] = $4; next}
!(key in f1) {print $1, $2, 0, 0, $3, $4; next}
{print key, f1[key], f2[key], $3, $4; delete f1[key]}
END {for (key in f1) print key, f1[key], f2[key], 0, 0}
' file1 file2 | sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
join
is still greek to me. I've never yet climbed that mountain.
– mikeserv
Jul 16 '15 at 21:15
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
@NareshDJ - you're getting the header on the last line because of your locale; addLANG=C
orLC_ALL=C
before the lastsort
e.g.... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3
<(sed 's/ +/_/' file1 | sort)
<(sed 's/ +/_/' file2 | sort) |
sed 's/_/ /' |
column -t |
sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
The trickiest part here are the reasons for sed
-- join
will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100
, chr1_200
, etc.
join
requires its input files to be sorted.
I use process substitution so that join
can work with the sed|sort
pipelines like files.
Then another sed
call to undo the combined field, and then column
to make it pretty.
By default, join
uses the first field of each file as the key field.
By default, join
does an inner join: only keys present in both files are printed. The -a1
and -a2
option enable the full outer join we want. The -e
option provides the default value for null fields, and we need the -o
option to specify that we want all the fields.
Can also use awk:
awk '
{key = $1 OFS $2}
NR == FNR {f1[key] = $3; f2[key] = $4; next}
!(key in f1) {print $1, $2, 0, 0, $3, $4; next}
{print key, f1[key], f2[key], $3, $4; delete f1[key]}
END {for (key in f1) print key, f1[key], f2[key], 0, 0}
' file1 file2 | sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
join
is still greek to me. I've never yet climbed that mountain.
– mikeserv
Jul 16 '15 at 21:15
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
@NareshDJ - you're getting the header on the last line because of your locale; addLANG=C
orLC_ALL=C
before the lastsort
e.g.... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
add a comment |
up vote
5
down vote
accepted
join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3
<(sed 's/ +/_/' file1 | sort)
<(sed 's/ +/_/' file2 | sort) |
sed 's/_/ /' |
column -t |
sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
The trickiest part here are the reasons for sed
-- join
will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100
, chr1_200
, etc.
join
requires its input files to be sorted.
I use process substitution so that join
can work with the sed|sort
pipelines like files.
Then another sed
call to undo the combined field, and then column
to make it pretty.
By default, join
uses the first field of each file as the key field.
By default, join
does an inner join: only keys present in both files are printed. The -a1
and -a2
option enable the full outer join we want. The -e
option provides the default value for null fields, and we need the -o
option to specify that we want all the fields.
Can also use awk:
awk '
{key = $1 OFS $2}
NR == FNR {f1[key] = $3; f2[key] = $4; next}
!(key in f1) {print $1, $2, 0, 0, $3, $4; next}
{print key, f1[key], f2[key], $3, $4; delete f1[key]}
END {for (key in f1) print key, f1[key], f2[key], 0, 0}
' file1 file2 | sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
join
is still greek to me. I've never yet climbed that mountain.
– mikeserv
Jul 16 '15 at 21:15
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
@NareshDJ - you're getting the header on the last line because of your locale; addLANG=C
orLC_ALL=C
before the lastsort
e.g.... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
add a comment |
up vote
5
down vote
accepted
up vote
5
down vote
accepted
join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3
<(sed 's/ +/_/' file1 | sort)
<(sed 's/ +/_/' file2 | sort) |
sed 's/_/ /' |
column -t |
sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
The trickiest part here are the reasons for sed
-- join
will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100
, chr1_200
, etc.
join
requires its input files to be sorted.
I use process substitution so that join
can work with the sed|sort
pipelines like files.
Then another sed
call to undo the combined field, and then column
to make it pretty.
By default, join
uses the first field of each file as the key field.
By default, join
does an inner join: only keys present in both files are printed. The -a1
and -a2
option enable the full outer join we want. The -e
option provides the default value for null fields, and we need the -o
option to specify that we want all the fields.
Can also use awk:
awk '
{key = $1 OFS $2}
NR == FNR {f1[key] = $3; f2[key] = $4; next}
!(key in f1) {print $1, $2, 0, 0, $3, $4; next}
{print key, f1[key], f2[key], $3, $4; delete f1[key]}
END {for (key in f1) print key, f1[key], f2[key], 0, 0}
' file1 file2 | sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
join -a1 -a2 -e 0 -o 0,1.2,1.3,2.2,2.3
<(sed 's/ +/_/' file1 | sort)
<(sed 's/ +/_/' file2 | sort) |
sed 's/_/ /' |
column -t |
sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
The trickiest part here are the reasons for sed
-- join
will only join on a single field, and here the join criteria is the first 2 fields. So, we have to combine those fields into a single word: I replace the first sequence of whitespace with an underscore so join will see chr1_100
, chr1_200
, etc.
join
requires its input files to be sorted.
I use process substitution so that join
can work with the sed|sort
pipelines like files.
Then another sed
call to undo the combined field, and then column
to make it pretty.
By default, join
uses the first field of each file as the key field.
By default, join
does an inner join: only keys present in both files are printed. The -a1
and -a2
option enable the full outer join we want. The -e
option provides the default value for null fields, and we need the -o
option to specify that we want all the fields.
Can also use awk:
awk '
{key = $1 OFS $2}
NR == FNR {f1[key] = $3; f2[key] = $4; next}
!(key in f1) {print $1, $2, 0, 0, $3, $4; next}
{print key, f1[key], f2[key], $3, $4; delete f1[key]}
END {for (key in f1) print key, f1[key], f2[key], 0, 0}
' file1 file2 | sort
chr loc T1 C1 T2 C2
chr1 100 2 3 1 2
chr1 200 3 4 0 0
chr1 300 0 0 4 1
chr2 100 1 4 7 5
chr2 400 3 1 0 0
chr2 500 0 0 1 9
edited Jul 17 '15 at 14:01
answered Jul 16 '15 at 21:13
glenn jackman
49.6k569106
49.6k569106
join
is still greek to me. I've never yet climbed that mountain.
– mikeserv
Jul 16 '15 at 21:15
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
@NareshDJ - you're getting the header on the last line because of your locale; addLANG=C
orLC_ALL=C
before the lastsort
e.g.... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
add a comment |
join
is still greek to me. I've never yet climbed that mountain.
– mikeserv
Jul 16 '15 at 21:15
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
@NareshDJ - you're getting the header on the last line because of your locale; addLANG=C
orLC_ALL=C
before the lastsort
e.g.... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
join
is still greek to me. I've never yet climbed that mountain.– mikeserv
Jul 16 '15 at 21:15
join
is still greek to me. I've never yet climbed that mountain.– mikeserv
Jul 16 '15 at 21:15
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
Hi @glenn jackmann, its not giving the output as you mentioned. Instead it gives output like this. chr1 100 2 100 1 chr1 100 2 300 4 chr1 200 3 100 1 chr1 200 3 300 4 chr2 100 1 100 7 chr2 100 1 500 1 chr2 400 3 100 7 chr2 400 3 500 1 chr loc T1 loc T2
– Naresh DJ
Jul 16 '15 at 21:40
@NareshDJ - you're getting the header on the last line because of your locale; add
LANG=C
or LC_ALL=C
before the last sort
e.g. ... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
@NareshDJ - you're getting the header on the last line because of your locale; add
LANG=C
or LC_ALL=C
before the last sort
e.g. ... | LANG=C sort
– don_crissti
Jul 17 '15 at 15:36
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f216569%2fcombine-two-files-to-single-file-with-combined-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown