wget: Retrieving a list of URLs when modifying input data file on the fly

This issue is currently driving me up the wall.

It just does not work as it should.

I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:

http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3

http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3

http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3

http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3 

http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3

As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:

$ wget -nc -i $(cut -f1 '-d ' inp)

respectively

$ wget -nc -i $(awk 'print $1' inp)

But it will download all the mp3 files, then grind for a short while, and something very strange will happen:

--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04

Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget and using it with the -i parameter, it works:

$ cat inp | awk '{print $1}' > inp2

Why is there so much difference when inp gets modified on the fly and passed directly to wget?
The most interesting thing is that the on-the-fly variant won't work with either awk or cut, so neither of both tools are to blame.

edited Sep 1 '14 at 16:31

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

1

What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01

Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13

Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24

@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34

@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10

|
show 1 more comment

This issue is currently driving me up the wall.

It just does not work as it should.

http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3

http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3

http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3

http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3 

http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3

As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:

$ wget -nc -i $(cut -f1 '-d ' inp)

respectively

$ wget -nc -i $(awk 'print $1' inp)

But it will download all the mp3 files, then grind for a short while, and something very strange will happen:

--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04

$ cat inp | awk '{print $1}' > inp2

edited Sep 1 '14 at 16:31

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

1

What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01

Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13

Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24

@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34

@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10

|
show 1 more comment

This issue is currently driving me up the wall.

It just does not work as it should.

http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3

http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3

http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3

http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3 

http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3

As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:

$ wget -nc -i $(cut -f1 '-d ' inp)

respectively

$ wget -nc -i $(awk 'print $1' inp)

But it will download all the mp3 files, then grind for a short while, and something very strange will happen:

--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04

$ cat inp | awk '{print $1}' > inp2

edited Sep 1 '14 at 16:31

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

This issue is currently driving me up the wall.

It just does not work as it should.

http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3

http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3

http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3

http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3 

http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3

As I only need the first part on each line, I've tried awk or alternatively cut to strip the rest, but on the fly:

$ wget -nc -i $(cut -f1 '-d ' inp)

respectively

$ wget -nc -i $(awk 'print $1' inp)

But it will download all the mp3 files, then grind for a short while, and something very strange will happen:

--2014-09-01 14:27:25--  http://whatever.site/data/samples/ID3%04

$ cat inp | awk '{print $1}' > inp2

awk wget cut

edited Sep 1 '14 at 16:31

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

edited Sep 1 '14 at 16:31

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

edited Sep 1 '14 at 16:31

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

asked Sep 1 '14 at 12:37

syntaxerror

1,15821541

1

What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01

Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13

Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24

@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34

@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10

|
show 1 more comment

1

What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01

Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13

Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24

@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34

@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10

What happens if you swap them over: awk '{print $1}' inp | wget -i - ?

– garethTheRed
Sep 1 '14 at 13:01

Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.

– Mark Plotnick
Sep 1 '14 at 13:13

Alternatively, I guess wget -nc -i <$(cut -f1 -d' ' inp) will work, if you're using bash.

– Mark Plotnick
Sep 1 '14 at 13:24

@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use wget -nc -i <(cut -f1 -d' ' inp). No dollars here. :)

– syntaxerror
Sep 1 '14 at 16:34

@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)

– Anko
Sep 1 '14 at 19:10

|
show 1 more comment

1 Answer
1

active

oldest

votes

The reason it didn't work is bad syntax:

wget -nc -i $(cut -f1 '-d ' inp)

...the problem is the -i switch expects either:

a local text file containing list of URLs

a remote text file containing list of URLs

a remote HTML file containing list of files local to it.

But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:

COLUMNS=72 man wget | grep -m1 -A 22 '-i '

   -i file

   --input-file=file

       Read URLs from a local or external file.  If - is specified

       as file, URLs are read from the standard input.  (Use ./-

       to read from a file literally named -.)



       If this function is used, no URLs need be present on the

       command line.  If there are URLs both on the command line

       and in an input file, those on the command lines will be

       the first ones to be retrieved.  If --force-html is not

       specified, then file should consist of a series of URLs,

       one per line.



       However, if you specify --force-html, the document will be

       regarded as html.  In that case you may have problems with

       relative links, which you can solve either by adding "<base

       href="url">" to the documents or by specifying --base=url

       on the command line.



       If the file is an external one, the document will be

       automatically treated as html if the Content-Type matches

       text/html.  Furthermore, the file's location will be

       implicitly used as base href if none was specified.

Fixes include:

Using stdin for the -i parameter as per garethTheRed's
comment:
```
cut -d' ' -f1 inp | wget -nc -i -
```

Or this bash centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
```
wget -nc -i <(cut -f1 '-d ' inp)
```

answered Feb 11 '18 at 20:41

agc

4,62811036

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f153152%2fwget-retrieving-a-list-of-urls-when-modifying-input-data-file-on-the-fly%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The reason it didn't work is bad syntax:

wget -nc -i $(cut -f1 '-d ' inp)

...the problem is the -i switch expects either:

a local text file containing list of URLs

a remote text file containing list of URLs

a remote HTML file containing list of files local to it.

But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:

COLUMNS=72 man wget | grep -m1 -A 22 '-i '

   -i file

   --input-file=file

       Read URLs from a local or external file.  If - is specified

       as file, URLs are read from the standard input.  (Use ./-

       to read from a file literally named -.)



       If this function is used, no URLs need be present on the

       command line.  If there are URLs both on the command line

       and in an input file, those on the command lines will be

       the first ones to be retrieved.  If --force-html is not

       specified, then file should consist of a series of URLs,

       one per line.



       However, if you specify --force-html, the document will be

       regarded as html.  In that case you may have problems with

       relative links, which you can solve either by adding "<base

       href="url">" to the documents or by specifying --base=url

       on the command line.



       If the file is an external one, the document will be

       automatically treated as html if the Content-Type matches

       text/html.  Furthermore, the file's location will be

       implicitly used as base href if none was specified.

Fixes include:

Using stdin for the -i parameter as per garethTheRed's
comment:
```
cut -d' ' -f1 inp | wget -nc -i -
```

Or this bash centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
```
wget -nc -i <(cut -f1 '-d ' inp)
```

answered Feb 11 '18 at 20:41

agc

4,62811036

add a comment |

The reason it didn't work is bad syntax:

wget -nc -i $(cut -f1 '-d ' inp)

...the problem is the -i switch expects either:

a local text file containing list of URLs

a remote text file containing list of URLs

a remote HTML file containing list of files local to it.

But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:

COLUMNS=72 man wget | grep -m1 -A 22 '-i '

   -i file

   --input-file=file

       Read URLs from a local or external file.  If - is specified

       as file, URLs are read from the standard input.  (Use ./-

       to read from a file literally named -.)



       If this function is used, no URLs need be present on the

       command line.  If there are URLs both on the command line

       and in an input file, those on the command lines will be

       the first ones to be retrieved.  If --force-html is not

       specified, then file should consist of a series of URLs,

       one per line.



       However, if you specify --force-html, the document will be

       regarded as html.  In that case you may have problems with

       relative links, which you can solve either by adding "<base

       href="url">" to the documents or by specifying --base=url

       on the command line.



       If the file is an external one, the document will be

       automatically treated as html if the Content-Type matches

       text/html.  Furthermore, the file's location will be

       implicitly used as base href if none was specified.

Fixes include:

Using stdin for the -i parameter as per garethTheRed's
comment:
```
cut -d' ' -f1 inp | wget -nc -i -
```

Or this bash centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
```
wget -nc -i <(cut -f1 '-d ' inp)
```

answered Feb 11 '18 at 20:41

agc

4,62811036

add a comment |

The reason it didn't work is bad syntax:

wget -nc -i $(cut -f1 '-d ' inp)

...the problem is the -i switch expects either:

a local text file containing list of URLs

a remote text file containing list of URLs

a remote HTML file containing list of files local to it.

But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:

COLUMNS=72 man wget | grep -m1 -A 22 '-i '

   -i file

   --input-file=file

       Read URLs from a local or external file.  If - is specified

       as file, URLs are read from the standard input.  (Use ./-

       to read from a file literally named -.)



       If this function is used, no URLs need be present on the

       command line.  If there are URLs both on the command line

       and in an input file, those on the command lines will be

       the first ones to be retrieved.  If --force-html is not

       specified, then file should consist of a series of URLs,

       one per line.



       However, if you specify --force-html, the document will be

       regarded as html.  In that case you may have problems with

       relative links, which you can solve either by adding "<base

       href="url">" to the documents or by specifying --base=url

       on the command line.



       If the file is an external one, the document will be

       automatically treated as html if the Content-Type matches

       text/html.  Furthermore, the file's location will be

       implicitly used as base href if none was specified.

Fixes include:

Using stdin for the -i parameter as per garethTheRed's
comment:
```
cut -d' ' -f1 inp | wget -nc -i -
```

Or this bash centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
```
wget -nc -i <(cut -f1 '-d ' inp)
```

answered Feb 11 '18 at 20:41

agc

4,62811036

The reason it didn't work is bad syntax:

wget -nc -i $(cut -f1 '-d ' inp)

...the problem is the -i switch expects either:

a local text file containing list of URLs

a remote text file containing list of URLs

a remote HTML file containing list of files local to it.

But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3, which isn't a text or HMTL file. man wget says:

COLUMNS=72 man wget | grep -m1 -A 22 '-i '

   -i file

   --input-file=file

       Read URLs from a local or external file.  If - is specified

       as file, URLs are read from the standard input.  (Use ./-

       to read from a file literally named -.)



       If this function is used, no URLs need be present on the

       command line.  If there are URLs both on the command line

       and in an input file, those on the command lines will be

       the first ones to be retrieved.  If --force-html is not

       specified, then file should consist of a series of URLs,

       one per line.



       However, if you specify --force-html, the document will be

       regarded as html.  In that case you may have problems with

       relative links, which you can solve either by adding "<base

       href="url">" to the documents or by specifying --base=url

       on the command line.



       If the file is an external one, the document will be

       automatically treated as html if the Content-Type matches

       text/html.  Furthermore, the file's location will be

       implicitly used as base href if none was specified.

Fixes include:

Using stdin for the -i parameter as per garethTheRed's
comment:
```
cut -d' ' -f1 inp | wget -nc -i -
```

Or this bash centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
```
wget -nc -i <(cut -f1 '-d ' inp)
```

answered Feb 11 '18 at 20:41

agc

4,62811036

answered Feb 11 '18 at 20:41

agc

4,62811036

answered Feb 11 '18 at 20:41

agc

4,62811036

answered Feb 11 '18 at 20:41

agc

4,62811036

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk