Unix script to join data from different rows into one single row












0














We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.



We need to join them together based on the tab counts.



For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,



Row Number  Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995


N.B The above row split is not consitent and varies.



I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.



We have the below command to join the lines based on new line delimiter.



paste -sd 'n' inputfile > output file


I want to know,




  1. If we can get the tab counts for different rows

  2. Add the tab counts to get sum of 995

  3. Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.


Please let me know if this can be achieved using shell script.



Thanks.!










share|improve this question
























  • It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
    – Kusalananda
    Dec 11 at 6:43










  • This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
    – v.rajan
    Dec 11 at 6:46












  • Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
    – tink
    Dec 11 at 7:39
















0














We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.



We need to join them together based on the tab counts.



For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,



Row Number  Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995


N.B The above row split is not consitent and varies.



I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.



We have the below command to join the lines based on new line delimiter.



paste -sd 'n' inputfile > output file


I want to know,




  1. If we can get the tab counts for different rows

  2. Add the tab counts to get sum of 995

  3. Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.


Please let me know if this can be achieved using shell script.



Thanks.!










share|improve this question
























  • It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
    – Kusalananda
    Dec 11 at 6:43










  • This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
    – v.rajan
    Dec 11 at 6:46












  • Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
    – tink
    Dec 11 at 7:39














0












0








0







We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.



We need to join them together based on the tab counts.



For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,



Row Number  Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995


N.B The above row split is not consitent and varies.



I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.



We have the below command to join the lines based on new line delimiter.



paste -sd 'n' inputfile > output file


I want to know,




  1. If we can get the tab counts for different rows

  2. Add the tab counts to get sum of 995

  3. Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.


Please let me know if this can be achieved using shell script.



Thanks.!










share|improve this question















We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.



We need to join them together based on the tab counts.



For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,



Row Number  Tab Count
Row 1 660
Row 2 0
Row 3 300
Row 5 20
Row 6 15
Total 995


N.B The above row split is not consitent and varies.



I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.



We have the below command to join the lines based on new line delimiter.



paste -sd 'n' inputfile > output file


I want to know,




  1. If we can get the tab counts for different rows

  2. Add the tab counts to get sum of 995

  3. Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.


Please let me know if this can be achieved using shell script.



Thanks.!







shell-script text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 11 at 10:47









Jeff Schaller

38.5k1053125




38.5k1053125










asked Dec 11 at 6:31









v.rajan

93




93












  • It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
    – Kusalananda
    Dec 11 at 6:43










  • This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
    – v.rajan
    Dec 11 at 6:46












  • Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
    – tink
    Dec 11 at 7:39


















  • It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
    – Kusalananda
    Dec 11 at 6:43










  • This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
    – v.rajan
    Dec 11 at 6:46












  • Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
    – tink
    Dec 11 at 7:39
















It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43




It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
– Kusalananda
Dec 11 at 6:43












This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46






This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
– v.rajan
Dec 11 at 6:46














Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39




Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
– tink
Dec 11 at 7:39










2 Answers
2






active

oldest

votes


















1














As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.



$ cat file
1 2
3
1 2 3
1
2
3




$ awk -v w=3 -f script.awk file
1 2 3
1 2 3
1 2 3


This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.



The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.



BEGIN { OFS = FS = "t" }

function output_line () {
# a function that outputs the nf elements in the array a
# separated by OFS (tab) and terminated by ORS (newline)

for (j = 1; j < nf; ++j)
printf("%s%s", a[j], OFS)

printf("%s%s", a[nf], ORS)
}

{
# a: an array of fields that we want to output together
# nf: the length of that array

# just add each field to the a array
for (i = 1; i <= NF; ++i) {
a[++nf] = $i

# if enough has been read, output the collected data
if (nf == w) {
output_line()
nf = 0
}
}
}

END {
# output any data remaining in a
if (nf > 0)
output_line()
}


This is the same as



tr 't' 'n' <file | paste - - -


for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.






share|improve this answer































    0














    Would continued reading lines until reaching field count help? From another post:



    awk -F't' '
    {while (NF<996) {getline X
    $0 = $0 FS X
    }
    }
    1
    ' file





    share|improve this answer





















    • The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
      – Kusalananda
      Dec 11 at 10:26










    • Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
      – RudiC
      Dec 11 at 10:33











    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f487272%2funix-script-to-join-data-from-different-rows-into-one-single-row%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.



    $ cat file
    1 2
    3
    1 2 3
    1
    2
    3




    $ awk -v w=3 -f script.awk file
    1 2 3
    1 2 3
    1 2 3


    This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.



    The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.



    BEGIN { OFS = FS = "t" }

    function output_line () {
    # a function that outputs the nf elements in the array a
    # separated by OFS (tab) and terminated by ORS (newline)

    for (j = 1; j < nf; ++j)
    printf("%s%s", a[j], OFS)

    printf("%s%s", a[nf], ORS)
    }

    {
    # a: an array of fields that we want to output together
    # nf: the length of that array

    # just add each field to the a array
    for (i = 1; i <= NF; ++i) {
    a[++nf] = $i

    # if enough has been read, output the collected data
    if (nf == w) {
    output_line()
    nf = 0
    }
    }
    }

    END {
    # output any data remaining in a
    if (nf > 0)
    output_line()
    }


    This is the same as



    tr 't' 'n' <file | paste - - -


    for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.






    share|improve this answer




























      1














      As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.



      $ cat file
      1 2
      3
      1 2 3
      1
      2
      3




      $ awk -v w=3 -f script.awk file
      1 2 3
      1 2 3
      1 2 3


      This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.



      The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.



      BEGIN { OFS = FS = "t" }

      function output_line () {
      # a function that outputs the nf elements in the array a
      # separated by OFS (tab) and terminated by ORS (newline)

      for (j = 1; j < nf; ++j)
      printf("%s%s", a[j], OFS)

      printf("%s%s", a[nf], ORS)
      }

      {
      # a: an array of fields that we want to output together
      # nf: the length of that array

      # just add each field to the a array
      for (i = 1; i <= NF; ++i) {
      a[++nf] = $i

      # if enough has been read, output the collected data
      if (nf == w) {
      output_line()
      nf = 0
      }
      }
      }

      END {
      # output any data remaining in a
      if (nf > 0)
      output_line()
      }


      This is the same as



      tr 't' 'n' <file | paste - - -


      for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.






      share|improve this answer


























        1












        1








        1






        As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.



        $ cat file
        1 2
        3
        1 2 3
        1
        2
        3




        $ awk -v w=3 -f script.awk file
        1 2 3
        1 2 3
        1 2 3


        This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.



        The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.



        BEGIN { OFS = FS = "t" }

        function output_line () {
        # a function that outputs the nf elements in the array a
        # separated by OFS (tab) and terminated by ORS (newline)

        for (j = 1; j < nf; ++j)
        printf("%s%s", a[j], OFS)

        printf("%s%s", a[nf], ORS)
        }

        {
        # a: an array of fields that we want to output together
        # nf: the length of that array

        # just add each field to the a array
        for (i = 1; i <= NF; ++i) {
        a[++nf] = $i

        # if enough has been read, output the collected data
        if (nf == w) {
        output_line()
        nf = 0
        }
        }
        }

        END {
        # output any data remaining in a
        if (nf > 0)
        output_line()
        }


        This is the same as



        tr 't' 'n' <file | paste - - -


        for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.






        share|improve this answer














        As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.



        $ cat file
        1 2
        3
        1 2 3
        1
        2
        3




        $ awk -v w=3 -f script.awk file
        1 2 3
        1 2 3
        1 2 3


        This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.



        The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.



        BEGIN { OFS = FS = "t" }

        function output_line () {
        # a function that outputs the nf elements in the array a
        # separated by OFS (tab) and terminated by ORS (newline)

        for (j = 1; j < nf; ++j)
        printf("%s%s", a[j], OFS)

        printf("%s%s", a[nf], ORS)
        }

        {
        # a: an array of fields that we want to output together
        # nf: the length of that array

        # just add each field to the a array
        for (i = 1; i <= NF; ++i) {
        a[++nf] = $i

        # if enough has been read, output the collected data
        if (nf == w) {
        output_line()
        nf = 0
        }
        }
        }

        END {
        # output any data remaining in a
        if (nf > 0)
        output_line()
        }


        This is the same as



        tr 't' 'n' <file | paste - - -


        for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Dec 11 at 8:11

























        answered Dec 11 at 7:38









        Kusalananda

        121k16228372




        121k16228372

























            0














            Would continued reading lines until reaching field count help? From another post:



            awk -F't' '
            {while (NF<996) {getline X
            $0 = $0 FS X
            }
            }
            1
            ' file





            share|improve this answer





















            • The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
              – Kusalananda
              Dec 11 at 10:26










            • Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
              – RudiC
              Dec 11 at 10:33
















            0














            Would continued reading lines until reaching field count help? From another post:



            awk -F't' '
            {while (NF<996) {getline X
            $0 = $0 FS X
            }
            }
            1
            ' file





            share|improve this answer





















            • The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
              – Kusalananda
              Dec 11 at 10:26










            • Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
              – RudiC
              Dec 11 at 10:33














            0












            0








            0






            Would continued reading lines until reaching field count help? From another post:



            awk -F't' '
            {while (NF<996) {getline X
            $0 = $0 FS X
            }
            }
            1
            ' file





            share|improve this answer












            Would continued reading lines until reaching field count help? From another post:



            awk -F't' '
            {while (NF<996) {getline X
            $0 = $0 FS X
            }
            }
            1
            ' file






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Dec 11 at 10:20









            RudiC

            3,9641312




            3,9641312












            • The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
              – Kusalananda
              Dec 11 at 10:26










            • Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
              – RudiC
              Dec 11 at 10:33


















            • The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
              – Kusalananda
              Dec 11 at 10:26










            • Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
              – RudiC
              Dec 11 at 10:33
















            The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
            – Kusalananda
            Dec 11 at 10:26




            The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
            – Kusalananda
            Dec 11 at 10:26












            Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
            – RudiC
            Dec 11 at 10:33




            Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
            – RudiC
            Dec 11 at 10:33


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f487272%2funix-script-to-join-data-from-different-rows-into-one-single-row%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            List directoties down one level, excluding some named directories and files

            list processes belonging to a network namespace

            list systemd RuntimeDirectory mounts