Unix script to join data from different rows into one single row
We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.
We need to join them together based on the tab counts.
For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,
Row Number Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995
N.B The above row split is not consitent and varies.
I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.
We have the below command to join the lines based on new line delimiter.
paste -sd 'n' inputfile > output file
I want to know,
- If we can get the tab counts for different rows
- Add the tab counts to get sum of 995
- Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.
Please let me know if this can be achieved using shell script.
Thanks.!
shell-script text-processing
add a comment |
We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.
We need to join them together based on the tab counts.
For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,
Row Number Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995
N.B The above row split is not consitent and varies.
I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.
We have the below command to join the lines based on new line delimiter.
paste -sd 'n' inputfile > output file
I want to know,
- If we can get the tab counts for different rows
- Add the tab counts to get sum of 995
- Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.
Please let me know if this can be achieved using shell script.
Thanks.!
shell-script text-processing
It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43
This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46
Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39
add a comment |
We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.
We need to join them together based on the tab counts.
For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,
Row Number Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995
N.B The above row split is not consitent and varies.
I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.
We have the below command to join the lines based on new line delimiter.
paste -sd 'n' inputfile > output file
I want to know,
- If we can get the tab counts for different rows
- Add the tab counts to get sum of 995
- Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.
Please let me know if this can be achieved using shell script.
Thanks.!
shell-script text-processing
We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.
We need to join them together based on the tab counts.
For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,
Row Number Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995
N.B The above row split is not consitent and varies.
I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.
We have the below command to join the lines based on new line delimiter.
paste -sd 'n' inputfile > output file
I want to know,
- If we can get the tab counts for different rows
- Add the tab counts to get sum of 995
- Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.
Please let me know if this can be achieved using shell script.
Thanks.!
shell-script text-processing
shell-script text-processing
edited Dec 11 at 10:47
Jeff Schaller
38.5k1053125
38.5k1053125
asked Dec 11 at 6:31
v.rajan
93
93
It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43
This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46
Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39
add a comment |
It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43
This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46
Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39
It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43
It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43
This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46
This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46
Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39
Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39
add a comment |
2 Answers
2
active
oldest
votes
As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.
$ cat file
1 2
3
1 2 3
1
2
3
$ awk -v w=3 -f script.awk file
1 2 3
1 2 3
1 2 3
This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.
The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.
BEGIN { OFS = FS = "t" }
function output_line () {
# a function that outputs the nf elements in the array a
# separated by OFS (tab) and terminated by ORS (newline)
for (j = 1; j < nf; ++j)
printf("%s%s", a[j], OFS)
printf("%s%s", a[nf], ORS)
}
{
# a: an array of fields that we want to output together
# nf: the length of that array
# just add each field to the a array
for (i = 1; i <= NF; ++i) {
a[++nf] = $i
# if enough has been read, output the collected data
if (nf == w) {
output_line()
nf = 0
}
}
}
END {
# output any data remaining in a
if (nf > 0)
output_line()
}
This is the same as
tr 't' 'n' <file | paste - - -
for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.
add a comment |
Would continued reading lines until reaching field count help? From another post:
awk -F't' '
{while (NF<996) {getline X
$0 = $0 FS X
}
}
1
' file
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f487272%2funix-script-to-join-data-from-different-rows-into-one-single-row%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.
$ cat file
1 2
3
1 2 3
1
2
3
$ awk -v w=3 -f script.awk file
1 2 3
1 2 3
1 2 3
This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.
The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.
BEGIN { OFS = FS = "t" }
function output_line () {
# a function that outputs the nf elements in the array a
# separated by OFS (tab) and terminated by ORS (newline)
for (j = 1; j < nf; ++j)
printf("%s%s", a[j], OFS)
printf("%s%s", a[nf], ORS)
}
{
# a: an array of fields that we want to output together
# nf: the length of that array
# just add each field to the a array
for (i = 1; i <= NF; ++i) {
a[++nf] = $i
# if enough has been read, output the collected data
if (nf == w) {
output_line()
nf = 0
}
}
}
END {
# output any data remaining in a
if (nf > 0)
output_line()
}
This is the same as
tr 't' 'n' <file | paste - - -
for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.
add a comment |
As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.
$ cat file
1 2
3
1 2 3
1
2
3
$ awk -v w=3 -f script.awk file
1 2 3
1 2 3
1 2 3
This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.
The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.
BEGIN { OFS = FS = "t" }
function output_line () {
# a function that outputs the nf elements in the array a
# separated by OFS (tab) and terminated by ORS (newline)
for (j = 1; j < nf; ++j)
printf("%s%s", a[j], OFS)
printf("%s%s", a[nf], ORS)
}
{
# a: an array of fields that we want to output together
# nf: the length of that array
# just add each field to the a array
for (i = 1; i <= NF; ++i) {
a[++nf] = $i
# if enough has been read, output the collected data
if (nf == w) {
output_line()
nf = 0
}
}
}
END {
# output any data remaining in a
if (nf > 0)
output_line()
}
This is the same as
tr 't' 'n' <file | paste - - -
for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.
add a comment |
As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.
$ cat file
1 2
3
1 2 3
1
2
3
$ awk -v w=3 -f script.awk file
1 2 3
1 2 3
1 2 3
This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.
The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.
BEGIN { OFS = FS = "t" }
function output_line () {
# a function that outputs the nf elements in the array a
# separated by OFS (tab) and terminated by ORS (newline)
for (j = 1; j < nf; ++j)
printf("%s%s", a[j], OFS)
printf("%s%s", a[nf], ORS)
}
{
# a: an array of fields that we want to output together
# nf: the length of that array
# just add each field to the a array
for (i = 1; i <= NF; ++i) {
a[++nf] = $i
# if enough has been read, output the collected data
if (nf == w) {
output_line()
nf = 0
}
}
}
END {
# output any data remaining in a
if (nf > 0)
output_line()
}
This is the same as
tr 't' 'n' <file | paste - - -
for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.
As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.
$ cat file
1 2
3
1 2 3
1
2
3
$ awk -v w=3 -f script.awk file
1 2 3
1 2 3
1 2 3
This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.
The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.
BEGIN { OFS = FS = "t" }
function output_line () {
# a function that outputs the nf elements in the array a
# separated by OFS (tab) and terminated by ORS (newline)
for (j = 1; j < nf; ++j)
printf("%s%s", a[j], OFS)
printf("%s%s", a[nf], ORS)
}
{
# a: an array of fields that we want to output together
# nf: the length of that array
# just add each field to the a array
for (i = 1; i <= NF; ++i) {
a[++nf] = $i
# if enough has been read, output the collected data
if (nf == w) {
output_line()
nf = 0
}
}
}
END {
# output any data remaining in a
if (nf > 0)
output_line()
}
This is the same as
tr 't' 'n' <file | paste - - -
for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.
edited Dec 11 at 8:11
answered Dec 11 at 7:38
Kusalananda
121k16228372
121k16228372
add a comment |
add a comment |
Would continued reading lines until reaching field count help? From another post:
awk -F't' '
{while (NF<996) {getline X
$0 = $0 FS X
}
}
1
' file
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
add a comment |
Would continued reading lines until reaching field count help? From another post:
awk -F't' '
{while (NF<996) {getline X
$0 = $0 FS X
}
}
1
' file
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
add a comment |
Would continued reading lines until reaching field count help? From another post:
awk -F't' '
{while (NF<996) {getline X
$0 = $0 FS X
}
}
1
' file
Would continued reading lines until reaching field count help? From another post:
awk -F't' '
{while (NF<996) {getline X
$0 = $0 FS X
}
}
1
' file
answered Dec 11 at 10:20
RudiC
3,9641312
3,9641312
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
add a comment |
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
– Kusalananda
Dec 11 at 10:26
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
– RudiC
Dec 11 at 10:33
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f487272%2funix-script-to-join-data-from-different-rows-into-one-single-row%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43
This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46
Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39