wget: Retrieving a list of URLs when modifying input data file on the fly
This issue is currently driving me up the wall.
It just does not work as it should.
I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:
http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3
As I only need the first part on each line, I've tried awk
or alternatively cut
to strip the rest, but on the fly:
$ wget -nc -i $(cut -f1 '-d ' inp)
respectively
$ wget -nc -i $(awk 'print $1' inp)
But it will download all the mp3 files, then grind for a short while, and something very strange will happen:
--2014-09-01 14:27:25-- http://whatever.site/data/samples/ID3%04
Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget
is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget
and using it with the -i
parameter, it works:
$ cat inp | awk '{print $1}' > inp2
Why is there so much difference when inp gets modified on the fly and passed directly to wget
?
The most interesting thing is that the on-the-fly variant won't work with either awk
or cut
, so neither of both tools are to blame.
awk wget cut
|
show 1 more comment
This issue is currently driving me up the wall.
It just does not work as it should.
I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:
http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3
As I only need the first part on each line, I've tried awk
or alternatively cut
to strip the rest, but on the fly:
$ wget -nc -i $(cut -f1 '-d ' inp)
respectively
$ wget -nc -i $(awk 'print $1' inp)
But it will download all the mp3 files, then grind for a short while, and something very strange will happen:
--2014-09-01 14:27:25-- http://whatever.site/data/samples/ID3%04
Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget
is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget
and using it with the -i
parameter, it works:
$ cat inp | awk '{print $1}' > inp2
Why is there so much difference when inp gets modified on the fly and passed directly to wget
?
The most interesting thing is that the on-the-fly variant won't work with either awk
or cut
, so neither of both tools are to blame.
awk wget cut
1
What happens if you swap them over:awk '{print $1}' inp | wget -i -
?
– garethTheRed
Sep 1 '14 at 13:01
Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.
– Mark Plotnick
Sep 1 '14 at 13:13
Alternatively, I guesswget -nc -i <$(cut -f1 -d' ' inp)
will work, if you're using bash.
– Mark Plotnick
Sep 1 '14 at 13:24
@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you usewget -nc -i <(cut -f1 -d' ' inp)
. No dollars here. :)
– syntaxerror
Sep 1 '14 at 16:34
@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)
– Anko
Sep 1 '14 at 19:10
|
show 1 more comment
This issue is currently driving me up the wall.
It just does not work as it should.
I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:
http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3
As I only need the first part on each line, I've tried awk
or alternatively cut
to strip the rest, but on the fly:
$ wget -nc -i $(cut -f1 '-d ' inp)
respectively
$ wget -nc -i $(awk 'print $1' inp)
But it will download all the mp3 files, then grind for a short while, and something very strange will happen:
--2014-09-01 14:27:25-- http://whatever.site/data/samples/ID3%04
Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget
is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget
and using it with the -i
parameter, it works:
$ cat inp | awk '{print $1}' > inp2
Why is there so much difference when inp gets modified on the fly and passed directly to wget
?
The most interesting thing is that the on-the-fly variant won't work with either awk
or cut
, so neither of both tools are to blame.
awk wget cut
This issue is currently driving me up the wall.
It just does not work as it should.
I have a file inp with audio samples to download where I preserved the internal ID number by parsing some other location of the HTML source file to get rid of the internal (hex) filename, looking like this:
http://whatever.site/data/samples/hexfilename1.mp3 12345.mp3
http://whatever.site/data/samples/hexfilename2.mp3 12346.mp3
http://whatever.site/data/samples/hexfilename3.mp3 12347.mp3
http://whatever.site/data/samples/hexfilename4.mp3 12348.mp3
http://whatever.site/data/samples/hexfilename5.mp3 12349.mp3
As I only need the first part on each line, I've tried awk
or alternatively cut
to strip the rest, but on the fly:
$ wget -nc -i $(cut -f1 '-d ' inp)
respectively
$ wget -nc -i $(awk 'print $1' inp)
But it will download all the mp3 files, then grind for a short while, and something very strange will happen:
--2014-09-01 14:27:25-- http://whatever.site/data/samples/ID3%04
Ugh. It is exactly what you're thinking it is: indeed the first bytes of the binary mp3 file that wget
is trying to download, after it is finished downloading the regular ones (and supposed to terminate). But why does it happen?
If I go the clumsy way by creating a inp2 temporary file for wget
and using it with the -i
parameter, it works:
$ cat inp | awk '{print $1}' > inp2
Why is there so much difference when inp gets modified on the fly and passed directly to wget
?
The most interesting thing is that the on-the-fly variant won't work with either awk
or cut
, so neither of both tools are to blame.
awk wget cut
awk wget cut
edited Sep 1 '14 at 16:31
syntaxerror
asked Sep 1 '14 at 12:37
syntaxerrorsyntaxerror
1,15821541
1,15821541
1
What happens if you swap them over:awk '{print $1}' inp | wget -i -
?
– garethTheRed
Sep 1 '14 at 13:01
Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.
– Mark Plotnick
Sep 1 '14 at 13:13
Alternatively, I guesswget -nc -i <$(cut -f1 -d' ' inp)
will work, if you're using bash.
– Mark Plotnick
Sep 1 '14 at 13:24
@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you usewget -nc -i <(cut -f1 -d' ' inp)
. No dollars here. :)
– syntaxerror
Sep 1 '14 at 16:34
@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)
– Anko
Sep 1 '14 at 19:10
|
show 1 more comment
1
What happens if you swap them over:awk '{print $1}' inp | wget -i -
?
– garethTheRed
Sep 1 '14 at 13:01
Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.
– Mark Plotnick
Sep 1 '14 at 13:13
Alternatively, I guesswget -nc -i <$(cut -f1 -d' ' inp)
will work, if you're using bash.
– Mark Plotnick
Sep 1 '14 at 13:24
@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you usewget -nc -i <(cut -f1 -d' ' inp)
. No dollars here. :)
– syntaxerror
Sep 1 '14 at 16:34
@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)
– Anko
Sep 1 '14 at 19:10
1
1
What happens if you swap them over:
awk '{print $1}' inp | wget -i -
?– garethTheRed
Sep 1 '14 at 13:01
What happens if you swap them over:
awk '{print $1}' inp | wget -i -
?– garethTheRed
Sep 1 '14 at 13:01
Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.
– Mark Plotnick
Sep 1 '14 at 13:13
Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.
– Mark Plotnick
Sep 1 '14 at 13:13
Alternatively, I guess
wget -nc -i <$(cut -f1 -d' ' inp)
will work, if you're using bash.– Mark Plotnick
Sep 1 '14 at 13:24
Alternatively, I guess
wget -nc -i <$(cut -f1 -d' ' inp)
will work, if you're using bash.– Mark Plotnick
Sep 1 '14 at 13:24
@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use
wget -nc -i <(cut -f1 -d' ' inp)
. No dollars here. :)– syntaxerror
Sep 1 '14 at 16:34
@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use
wget -nc -i <(cut -f1 -d' ' inp)
. No dollars here. :)– syntaxerror
Sep 1 '14 at 16:34
@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)
– Anko
Sep 1 '14 at 19:10
@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)
– Anko
Sep 1 '14 at 19:10
|
show 1 more comment
1 Answer
1
active
oldest
votes
The reason it didn't work is bad syntax:
wget -nc -i $(cut -f1 '-d ' inp)
...the problem is the -i
switch expects either:
- a local text file containing list of URLs
- a remote text file containing list of URLs
- a remote HTML file containing list of files local to it.
But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3
, which isn't a text or HMTL file. man wget
says:
COLUMNS=72 man wget | grep -m1 -A 22 '-i '
-i file
--input-file=file
Read URLs from a local or external file. If - is specified
as file, URLs are read from the standard input. (Use ./-
to read from a file literally named -.)
If this function is used, no URLs need be present on the
command line. If there are URLs both on the command line
and in an input file, those on the command lines will be
the first ones to be retrieved. If --force-html is not
specified, then file should consist of a series of URLs,
one per line.
However, if you specify --force-html, the document will be
regarded as html. In that case you may have problems with
relative links, which you can solve either by adding "<base
href="url">" to the documents or by specifying --base=url
on the command line.
If the file is an external one, the document will be
automatically treated as html if the Content-Type matches
text/html. Furthermore, the file's location will be
implicitly used as base href if none was specified.
Fixes include:
Using stdin for the
-i
parameter as per garethTheRed's
comment:
cut -d' ' -f1 inp | wget -nc -i -
Or this
bash
centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
wget -nc -i <(cut -f1 '-d ' inp)
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f153152%2fwget-retrieving-a-list-of-urls-when-modifying-input-data-file-on-the-fly%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The reason it didn't work is bad syntax:
wget -nc -i $(cut -f1 '-d ' inp)
...the problem is the -i
switch expects either:
- a local text file containing list of URLs
- a remote text file containing list of URLs
- a remote HTML file containing list of files local to it.
But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3
, which isn't a text or HMTL file. man wget
says:
COLUMNS=72 man wget | grep -m1 -A 22 '-i '
-i file
--input-file=file
Read URLs from a local or external file. If - is specified
as file, URLs are read from the standard input. (Use ./-
to read from a file literally named -.)
If this function is used, no URLs need be present on the
command line. If there are URLs both on the command line
and in an input file, those on the command lines will be
the first ones to be retrieved. If --force-html is not
specified, then file should consist of a series of URLs,
one per line.
However, if you specify --force-html, the document will be
regarded as html. In that case you may have problems with
relative links, which you can solve either by adding "<base
href="url">" to the documents or by specifying --base=url
on the command line.
If the file is an external one, the document will be
automatically treated as html if the Content-Type matches
text/html. Furthermore, the file's location will be
implicitly used as base href if none was specified.
Fixes include:
Using stdin for the
-i
parameter as per garethTheRed's
comment:
cut -d' ' -f1 inp | wget -nc -i -
Or this
bash
centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
wget -nc -i <(cut -f1 '-d ' inp)
add a comment |
The reason it didn't work is bad syntax:
wget -nc -i $(cut -f1 '-d ' inp)
...the problem is the -i
switch expects either:
- a local text file containing list of URLs
- a remote text file containing list of URLs
- a remote HTML file containing list of files local to it.
But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3
, which isn't a text or HMTL file. man wget
says:
COLUMNS=72 man wget | grep -m1 -A 22 '-i '
-i file
--input-file=file
Read URLs from a local or external file. If - is specified
as file, URLs are read from the standard input. (Use ./-
to read from a file literally named -.)
If this function is used, no URLs need be present on the
command line. If there are URLs both on the command line
and in an input file, those on the command lines will be
the first ones to be retrieved. If --force-html is not
specified, then file should consist of a series of URLs,
one per line.
However, if you specify --force-html, the document will be
regarded as html. In that case you may have problems with
relative links, which you can solve either by adding "<base
href="url">" to the documents or by specifying --base=url
on the command line.
If the file is an external one, the document will be
automatically treated as html if the Content-Type matches
text/html. Furthermore, the file's location will be
implicitly used as base href if none was specified.
Fixes include:
Using stdin for the
-i
parameter as per garethTheRed's
comment:
cut -d' ' -f1 inp | wget -nc -i -
Or this
bash
centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
wget -nc -i <(cut -f1 '-d ' inp)
add a comment |
The reason it didn't work is bad syntax:
wget -nc -i $(cut -f1 '-d ' inp)
...the problem is the -i
switch expects either:
- a local text file containing list of URLs
- a remote text file containing list of URLs
- a remote HTML file containing list of files local to it.
But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3
, which isn't a text or HMTL file. man wget
says:
COLUMNS=72 man wget | grep -m1 -A 22 '-i '
-i file
--input-file=file
Read URLs from a local or external file. If - is specified
as file, URLs are read from the standard input. (Use ./-
to read from a file literally named -.)
If this function is used, no URLs need be present on the
command line. If there are URLs both on the command line
and in an input file, those on the command lines will be
the first ones to be retrieved. If --force-html is not
specified, then file should consist of a series of URLs,
one per line.
However, if you specify --force-html, the document will be
regarded as html. In that case you may have problems with
relative links, which you can solve either by adding "<base
href="url">" to the documents or by specifying --base=url
on the command line.
If the file is an external one, the document will be
automatically treated as html if the Content-Type matches
text/html. Furthermore, the file's location will be
implicitly used as base href if none was specified.
Fixes include:
Using stdin for the
-i
parameter as per garethTheRed's
comment:
cut -d' ' -f1 inp | wget -nc -i -
Or this
bash
centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
wget -nc -i <(cut -f1 '-d ' inp)
The reason it didn't work is bad syntax:
wget -nc -i $(cut -f1 '-d ' inp)
...the problem is the -i
switch expects either:
- a local text file containing list of URLs
- a remote text file containing list of URLs
- a remote HTML file containing list of files local to it.
But the code above gives -i http://whatever.site/data/samples/hexfilename1.mp3
, which isn't a text or HMTL file. man wget
says:
COLUMNS=72 man wget | grep -m1 -A 22 '-i '
-i file
--input-file=file
Read URLs from a local or external file. If - is specified
as file, URLs are read from the standard input. (Use ./-
to read from a file literally named -.)
If this function is used, no URLs need be present on the
command line. If there are URLs both on the command line
and in an input file, those on the command lines will be
the first ones to be retrieved. If --force-html is not
specified, then file should consist of a series of URLs,
one per line.
However, if you specify --force-html, the document will be
regarded as html. In that case you may have problems with
relative links, which you can solve either by adding "<base
href="url">" to the documents or by specifying --base=url
on the command line.
If the file is an external one, the document will be
automatically treated as html if the Content-Type matches
text/html. Furthermore, the file's location will be
implicitly used as base href if none was specified.
Fixes include:
Using stdin for the
-i
parameter as per garethTheRed's
comment:
cut -d' ' -f1 inp | wget -nc -i -
Or this
bash
centric method, which is about one byte off from what
was originally intended, as per syntaxerror's comment :
wget -nc -i <(cut -f1 '-d ' inp)
answered Feb 11 '18 at 20:41
agcagc
4,62811036
4,62811036
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f153152%2fwget-retrieving-a-list-of-urls-when-modifying-input-data-file-on-the-fly%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
What happens if you swap them over:
awk '{print $1}' inp | wget -i -
?– garethTheRed
Sep 1 '14 at 13:01
Do what @garethTheRed said. -i takes a file name as an argument (in your example, the first URL in your list) and reads it to get a list of URLs to retrieve.
– Mark Plotnick
Sep 1 '14 at 13:13
Alternatively, I guess
wget -nc -i <$(cut -f1 -d' ' inp)
will work, if you're using bash.– Mark Plotnick
Sep 1 '14 at 13:24
@garethTheRed You're a marvel. Doing them in this order works. Thank you very much. But thanks as well to Mark for the alternate solution, which works if you use
wget -nc -i <(cut -f1 -d' ' inp)
. No dollars here. :)– syntaxerror
Sep 1 '14 at 16:34
@syntaxerror If you found a solution, it's totally OK to add and accept your own answer. (There might be a minimum time limit until you can accept it though.)
– Anko
Sep 1 '14 at 19:10