How to follow all “HTML links”, and yet only save .zip files
up vote
-1
down vote
favorite
I am trying to do this:
function download() {
wget
-r
-c
-A .zip,.html,.jsp,.php,.cgi
--reject-regex "(.*)?(.*)"
--secure-protocol=TLSv1_2
"$1"
}
The problems I'm facing are:
- I only want the
.zip
files. - The
.zip
files are linked to on the "HTML" pages, which can be found under URLs such as/foo
,/foo.html
,/foo.jsp
,/foo.php
,/foo.cgi
, and a few others I am not aware of I'm sure.
So I am trying to say: "Visit every HTML link, but download every .zip
file." Wondering how to do this properly with wget
. I am also skipping visiting the URL parameter links because currently it downloads them all, but if there is a better way to handle this (such as just collecting the links from them and not downloading them, that would be good to know).
The above doesn't work because it misses links without an extension like /foo
. Plus, I don't want to save the php and other files, just the zip ones. Right now I am just leaving off the -A
and downloading everything, then removing it in an custom run script after wget finishes which doesn't seem right. Basically just wondering how to do this correctly with wget.
wget
add a comment |
up vote
-1
down vote
favorite
I am trying to do this:
function download() {
wget
-r
-c
-A .zip,.html,.jsp,.php,.cgi
--reject-regex "(.*)?(.*)"
--secure-protocol=TLSv1_2
"$1"
}
The problems I'm facing are:
- I only want the
.zip
files. - The
.zip
files are linked to on the "HTML" pages, which can be found under URLs such as/foo
,/foo.html
,/foo.jsp
,/foo.php
,/foo.cgi
, and a few others I am not aware of I'm sure.
So I am trying to say: "Visit every HTML link, but download every .zip
file." Wondering how to do this properly with wget
. I am also skipping visiting the URL parameter links because currently it downloads them all, but if there is a better way to handle this (such as just collecting the links from them and not downloading them, that would be good to know).
The above doesn't work because it misses links without an extension like /foo
. Plus, I don't want to save the php and other files, just the zip ones. Right now I am just leaving off the -A
and downloading everything, then removing it in an custom run script after wget finishes which doesn't seem right. Basically just wondering how to do this correctly with wget.
wget
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am trying to do this:
function download() {
wget
-r
-c
-A .zip,.html,.jsp,.php,.cgi
--reject-regex "(.*)?(.*)"
--secure-protocol=TLSv1_2
"$1"
}
The problems I'm facing are:
- I only want the
.zip
files. - The
.zip
files are linked to on the "HTML" pages, which can be found under URLs such as/foo
,/foo.html
,/foo.jsp
,/foo.php
,/foo.cgi
, and a few others I am not aware of I'm sure.
So I am trying to say: "Visit every HTML link, but download every .zip
file." Wondering how to do this properly with wget
. I am also skipping visiting the URL parameter links because currently it downloads them all, but if there is a better way to handle this (such as just collecting the links from them and not downloading them, that would be good to know).
The above doesn't work because it misses links without an extension like /foo
. Plus, I don't want to save the php and other files, just the zip ones. Right now I am just leaving off the -A
and downloading everything, then removing it in an custom run script after wget finishes which doesn't seem right. Basically just wondering how to do this correctly with wget.
wget
I am trying to do this:
function download() {
wget
-r
-c
-A .zip,.html,.jsp,.php,.cgi
--reject-regex "(.*)?(.*)"
--secure-protocol=TLSv1_2
"$1"
}
The problems I'm facing are:
- I only want the
.zip
files. - The
.zip
files are linked to on the "HTML" pages, which can be found under URLs such as/foo
,/foo.html
,/foo.jsp
,/foo.php
,/foo.cgi
, and a few others I am not aware of I'm sure.
So I am trying to say: "Visit every HTML link, but download every .zip
file." Wondering how to do this properly with wget
. I am also skipping visiting the URL parameter links because currently it downloads them all, but if there is a better way to handle this (such as just collecting the links from them and not downloading them, that would be good to know).
The above doesn't work because it misses links without an extension like /foo
. Plus, I don't want to save the php and other files, just the zip ones. Right now I am just leaving off the -A
and downloading everything, then removing it in an custom run script after wget finishes which doesn't seem right. Basically just wondering how to do this correctly with wget.
wget
wget
asked Nov 16 at 17:43
Lance Pollard
1397
1397
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f482204%2fhow-to-follow-all-html-links-and-yet-only-save-zip-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown