wget: Retrieving a list of URLs when modifying input data file on the fly












0















This issue is currently driving me up the wall.

It just does not work as it should.



I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:



http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3


As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:



$ wget -nc -i $(cut -f1 '-d ' inp)


respectively



$ wget -nc -i $(awk 'print $1' inp)


But it will download all the mp3 files, then grind for a short while, and something very strange will happen:



--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04


Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget and using it with the -i parameter, it works:



$ cat inp | awk '{print $1}' > inp2


Why is there so much difference when inp gets modified on the fly and passed directly to wget?
The most interesting thing is that the on-the-fly variant won't work with either awk or cut, so neither of both tools are to blame.










share|improve this question




















  • 1





    What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

    – garethTheRed
    Sep 1 '14 at 13:01













  • Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

    – Mark Plotnick
    Sep 1 '14 at 13:13











  • Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

    – Mark Plotnick
    Sep 1 '14 at 13:24











  • @garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

    – syntaxerror
    Sep 1 '14 at 16:34













  • @syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

    – Anko
    Sep 1 '14 at 19:10
















0















This issue is currently driving me up the wall.

It just does not work as it should.



I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:



http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3


As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:



$ wget -nc -i $(cut -f1 '-d ' inp)


respectively



$ wget -nc -i $(awk 'print $1' inp)


But it will download all the mp3 files, then grind for a short while, and something very strange will happen:



--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04


Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget and using it with the -i parameter, it works:



$ cat inp | awk '{print $1}' > inp2


Why is there so much difference when inp gets modified on the fly and passed directly to wget?
The most interesting thing is that the on-the-fly variant won't work with either awk or cut, so neither of both tools are to blame.










share|improve this question




















  • 1





    What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

    – garethTheRed
    Sep 1 '14 at 13:01













  • Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

    – Mark Plotnick
    Sep 1 '14 at 13:13











  • Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

    – Mark Plotnick
    Sep 1 '14 at 13:24











  • @garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

    – syntaxerror
    Sep 1 '14 at 16:34













  • @syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

    – Anko
    Sep 1 '14 at 19:10














0












0








0








This issue is currently driving me up the wall.

It just does not work as it should.



I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:



http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3


As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:



$ wget -nc -i $(cut -f1 '-d ' inp)


respectively



$ wget -nc -i $(awk 'print $1' inp)


But it will download all the mp3 files, then grind for a short while, and something very strange will happen:



--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04


Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget and using it with the -i parameter, it works:



$ cat inp | awk '{print $1}' > inp2


Why is there so much difference when inp gets modified on the fly and passed directly to wget?
The most interesting thing is that the on-the-fly variant won't work with either awk or cut, so neither of both tools are to blame.










share|improve this question
















This issue is currently driving me up the wall.

It just does not work as it should.



I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:



http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3


As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:



$ wget -nc -i $(cut -f1 '-d ' inp)


respectively



$ wget -nc -i $(awk 'print $1' inp)


But it will download all the mp3 files, then grind for a short while, and something very strange will happen:



--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04


Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget and using it with the -i parameter, it works:



$ cat inp | awk '{print $1}' > inp2


Why is there so much difference when inp gets modified on the fly and passed directly to wget?
The most interesting thing is that the on-the-fly variant won't work with either awk or cut, so neither of both tools are to blame.







awk wget cut






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 1 '14 at 16:31







syntaxerror

















asked Sep 1 '14 at 12:37









syntaxerrorsyntaxerror

1,15821541




1,15821541








  • 1





    What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

    – garethTheRed
    Sep 1 '14 at 13:01













  • Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

    – Mark Plotnick
    Sep 1 '14 at 13:13











  • Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

    – Mark Plotnick
    Sep 1 '14 at 13:24











  • @garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

    – syntaxerror
    Sep 1 '14 at 16:34













  • @syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

    – Anko
    Sep 1 '14 at 19:10














  • 1





    What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

    – garethTheRed
    Sep 1 '14 at 13:01













  • Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

    – Mark Plotnick
    Sep 1 '14 at 13:13











  • Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

    – Mark Plotnick
    Sep 1 '14 at 13:24











  • @garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

    – syntaxerror
    Sep 1 '14 at 16:34













  • @syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

    – Anko
    Sep 1 '14 at 19:10








1




1





What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01







What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01















Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13





Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13













Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24





Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24













@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34







@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34















@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10





@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10










1 Answer
1






active

oldest

votes


















0














The reason it didn't work is bad syntax:



wget -nc -i $(cut -f1 '-d ' inp)


...the problem is the -i switch expects either:




  1. a local text file containing list of URLs

  2. a remote text file containing list of URLs

  3. a remote HTML file containing list of files local to it.


But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:



COLUMNS=72 man wget | grep -m1 -A 22 '-i '
-i file
--input-file=file
Read URLs from a local or external file. If - is specified
as file, URLs are read from the standard input. (Use ./-
to read from a file literally named -.)

If this function is used, no URLs need be present on the
command line. If there are URLs both on the command line
and in an input file, those on the command lines will be
the first ones to be retrieved. If --force-html is not
specified, then file should consist of a series of URLs,
one per line.

However, if you specify --force-html, the document will be
regarded as html. In that case you may have problems with
relative links, which you can solve either by adding "<base
href="url">" to the documents or by specifying --base=url
on the command line.

If the file is an external one, the document will be
automatically treated as html if the Content-Type matches
text/html. Furthermore, the file's location will be
implicitly used as base href if none was specified.


Fixes include:





  1. Using stdin for the -i parameter as per garethTheRed's
    comment:



    cut -d' ' -f1 inp | wget -nc -i -



  2. Or this bash centric method, which is about one byte off from what
    was originally intended, as per syntaxerror's comment :



    wget -nc -i <(cut -f1 '-d ' inp)







share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f153152%2fwget-retrieving-a-list-of-urls-when-modifying-input-data-file-on-the-fly%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    The reason it didn't work is bad syntax:



    wget -nc -i $(cut -f1 '-d ' inp)


    ...the problem is the -i switch expects either:




    1. a local text file containing list of URLs

    2. a remote text file containing list of URLs

    3. a remote HTML file containing list of files local to it.


    But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:



    COLUMNS=72 man wget | grep -m1 -A 22 '-i '
    -i file
    --input-file=file
    Read URLs from a local or external file. If - is specified
    as file, URLs are read from the standard input. (Use ./-
    to read from a file literally named -.)

    If this function is used, no URLs need be present on the
    command line. If there are URLs both on the command line
    and in an input file, those on the command lines will be
    the first ones to be retrieved. If --force-html is not
    specified, then file should consist of a series of URLs,
    one per line.

    However, if you specify --force-html, the document will be
    regarded as html. In that case you may have problems with
    relative links, which you can solve either by adding "<base
    href="url">" to the documents or by specifying --base=url
    on the command line.

    If the file is an external one, the document will be
    automatically treated as html if the Content-Type matches
    text/html. Furthermore, the file's location will be
    implicitly used as base href if none was specified.


    Fixes include:





    1. Using stdin for the -i parameter as per garethTheRed's
      comment:



      cut -d' ' -f1 inp | wget -nc -i -



    2. Or this bash centric method, which is about one byte off from what
      was originally intended, as per syntaxerror's comment :



      wget -nc -i <(cut -f1 '-d ' inp)







    share|improve this answer




























      0














      The reason it didn't work is bad syntax:



      wget -nc -i $(cut -f1 '-d ' inp)


      ...the problem is the -i switch expects either:




      1. a local text file containing list of URLs

      2. a remote text file containing list of URLs

      3. a remote HTML file containing list of files local to it.


      But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:



      COLUMNS=72 man wget | grep -m1 -A 22 '-i '
      -i file
      --input-file=file
      Read URLs from a local or external file. If - is specified
      as file, URLs are read from the standard input. (Use ./-
      to read from a file literally named -.)

      If this function is used, no URLs need be present on the
      command line. If there are URLs both on the command line
      and in an input file, those on the command lines will be
      the first ones to be retrieved. If --force-html is not
      specified, then file should consist of a series of URLs,
      one per line.

      However, if you specify --force-html, the document will be
      regarded as html. In that case you may have problems with
      relative links, which you can solve either by adding "<base
      href="url">" to the documents or by specifying --base=url
      on the command line.

      If the file is an external one, the document will be
      automatically treated as html if the Content-Type matches
      text/html. Furthermore, the file's location will be
      implicitly used as base href if none was specified.


      Fixes include:





      1. Using stdin for the -i parameter as per garethTheRed's
        comment:



        cut -d' ' -f1 inp | wget -nc -i -



      2. Or this bash centric method, which is about one byte off from what
        was originally intended, as per syntaxerror's comment :



        wget -nc -i <(cut -f1 '-d ' inp)







      share|improve this answer


























        0












        0








        0







        The reason it didn't work is bad syntax:



        wget -nc -i $(cut -f1 '-d ' inp)


        ...the problem is the -i switch expects either:




        1. a local text file containing list of URLs

        2. a remote text file containing list of URLs

        3. a remote HTML file containing list of files local to it.


        But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:



        COLUMNS=72 man wget | grep -m1 -A 22 '-i '
        -i file
        --input-file=file
        Read URLs from a local or external file. If - is specified
        as file, URLs are read from the standard input. (Use ./-
        to read from a file literally named -.)

        If this function is used, no URLs need be present on the
        command line. If there are URLs both on the command line
        and in an input file, those on the command lines will be
        the first ones to be retrieved. If --force-html is not
        specified, then file should consist of a series of URLs,
        one per line.

        However, if you specify --force-html, the document will be
        regarded as html. In that case you may have problems with
        relative links, which you can solve either by adding "<base
        href="url">" to the documents or by specifying --base=url
        on the command line.

        If the file is an external one, the document will be
        automatically treated as html if the Content-Type matches
        text/html. Furthermore, the file's location will be
        implicitly used as base href if none was specified.


        Fixes include:





        1. Using stdin for the -i parameter as per garethTheRed's
          comment:



          cut -d' ' -f1 inp | wget -nc -i -



        2. Or this bash centric method, which is about one byte off from what
          was originally intended, as per syntaxerror's comment :



          wget -nc -i <(cut -f1 '-d ' inp)







        share|improve this answer













        The reason it didn't work is bad syntax:



        wget -nc -i $(cut -f1 '-d ' inp)


        ...the problem is the -i switch expects either:




        1. a local text file containing list of URLs

        2. a remote text file containing list of URLs

        3. a remote HTML file containing list of files local to it.


        But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:



        COLUMNS=72 man wget | grep -m1 -A 22 '-i '
        -i file
        --input-file=file
        Read URLs from a local or external file. If - is specified
        as file, URLs are read from the standard input. (Use ./-
        to read from a file literally named -.)

        If this function is used, no URLs need be present on the
        command line. If there are URLs both on the command line
        and in an input file, those on the command lines will be
        the first ones to be retrieved. If --force-html is not
        specified, then file should consist of a series of URLs,
        one per line.

        However, if you specify --force-html, the document will be
        regarded as html. In that case you may have problems with
        relative links, which you can solve either by adding "<base
        href="url">" to the documents or by specifying --base=url
        on the command line.

        If the file is an external one, the document will be
        automatically treated as html if the Content-Type matches
        text/html. Furthermore, the file's location will be
        implicitly used as base href if none was specified.


        Fixes include:





        1. Using stdin for the -i parameter as per garethTheRed's
          comment:



          cut -d' ' -f1 inp | wget -nc -i -



        2. Or this bash centric method, which is about one byte off from what
          was originally intended, as per syntaxerror's comment :



          wget -nc -i <(cut -f1 '-d ' inp)








        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 11 '18 at 20:41









        agcagc

        4,62811036




        4,62811036






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f153152%2fwget-retrieving-a-list-of-urls-when-modifying-input-data-file-on-the-fly%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Morgemoulin

            Scott Moir

            Souastre