wget - How to download recursively and only specific mime-types/extensions (i.e. text only)

up vote
17
down vote

favorite

How to download a full website, but ignoring all binary files.

wget has this functionality using the -r flag but it downloads everything and some websites are just too much for a low-resources machine and it's not of a use for the specific reason I'm downloading the site.

Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

1

wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33

@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43

add a comment |

up vote
17
down vote

favorite

How to download a full website, but ignoring all binary files.

Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

1

wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33

@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43

add a comment |

up vote
17
down vote

favorite

How to download a full website, but ignoring all binary files.

Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

How to download a full website, but ignoring all binary files.

Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)

wget recursive download mime-types

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

asked Oct 31 '12 at 8:15

Omar Al-Ithawi

196117

1

wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33

@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43

add a comment |

1

wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33

@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43

wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33

@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43

add a comment |

4 Answers
4

active

oldest

votes

up vote
17
down vote

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST

--accept LIST

Disallowed:

-R LIST

--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

Examples:

only download PNG files: -A png

don't download CSS files: -R css

don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).

edited Apr 13 '17 at 12:37

Community♦

answered Nov 27 '12 at 16:26

unor

576525

add a comment |

up vote
2
down vote

You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
– Prospero
Mar 29 '13 at 1:54

this looks great. any idea why this patch has not been yet accepted (four years later)?
– David Portabella
Sep 6 '16 at 8:57

add a comment |

up vote
1
down vote

accepted

I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?

The solution is to setup a Node.js proxy and configure Scrapy to use
it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
all HTTP traffic.

For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code That actually works!

http.createServer(function(clientReq, clientRes) {

    var options = {

        host: clientReq.headers['host'],

        port: 80,

        path: clientReq.url,

        method: clientReq.method,

        headers: clientReq.headers

    };





    var fullUrl = clientReq.headers['host'] + clientReq.url;



    var proxyReq = http.request(options, function(proxyRes) {

        var contentType = proxyRes.headers['content-type'] || '';

        if (!contentType.startsWith('text/')) {

            proxyRes.destroy();            

            var httpForbidden = 403;

            clientRes.writeHead(httpForbidden);

            clientRes.write('Binary download is disabled.');

            clientRes.end();

        }



        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);

        proxyRes.pipe(clientRes);

    });



    proxyReq.on('error', function(e) {

        console.log('problem with clientReq: ' + e.message);

    });



    proxyReq.end();



}).listen(8080);

edited May 23 '17 at 12:40

Community♦

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

add a comment |

up vote
0
down vote

A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`



### `--filter-mime-type=list`



Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.

If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download

something with exceptions. For example, download everything except images:



  wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*



It is also useful to download files that are compatible with an application of your system. For instance,

download every file that is compatible with LibreOffice Writer from a website using the recursive mode:



  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f53397%2fwget-how-to-download-recursively-and-only-specific-mime-types-extensions-i-e%23new-answer', 'question_page');
}
);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
17
down vote

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST

--accept LIST

Disallowed:

-R LIST

--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

Examples:

only download PNG files: -A png

don't download CSS files: -R css

don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).

edited Apr 13 '17 at 12:37

Community♦

answered Nov 27 '12 at 16:26

unor

576525

add a comment |

up vote
17
down vote

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST

--accept LIST

Disallowed:

-R LIST

--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

Examples:

only download PNG files: -A png

don't download CSS files: -R css

don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).

edited Apr 13 '17 at 12:37

Community♦

answered Nov 27 '12 at 16:26

unor

576525

add a comment |

up vote
17
down vote

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST

--accept LIST

Disallowed:

-R LIST

--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

Examples:

only download PNG files: -A png

don't download CSS files: -R css

don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).

edited Apr 13 '17 at 12:37

Community♦

answered Nov 27 '12 at 16:26

unor

576525

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST

--accept LIST

Disallowed:

-R LIST

--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

Examples:

only download PNG files: -A png

don't download CSS files: -R css

don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).

edited Apr 13 '17 at 12:37

Community♦

answered Nov 27 '12 at 16:26

unor

576525

edited Apr 13 '17 at 12:37

Community♦

edited Apr 13 '17 at 12:37

Community♦

edited Apr 13 '17 at 12:37

Community♦

answered Nov 27 '12 at 16:26

unor

576525

answered Nov 27 '12 at 16:26

unor

576525

answered Nov 27 '12 at 16:26

unor

576525

add a comment |

up vote
2
down vote

You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
– Prospero
Mar 29 '13 at 1:54

this looks great. any idea why this patch has not been yet accepted (four years later)?
– David Portabella
Sep 6 '16 at 8:57

add a comment |

up vote
2
down vote

You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
– Prospero
Mar 29 '13 at 1:54

this looks great. any idea why this patch has not been yet accepted (four years later)?
– David Portabella
Sep 6 '16 at 8:57

add a comment |

up vote
2
down vote

You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

answered Oct 31 '12 at 17:29

Lars Kotthoff

814611

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
– Prospero
Mar 29 '13 at 1:54

this looks great. any idea why this patch has not been yet accepted (four years later)?
– David Portabella
Sep 6 '16 at 8:57

add a comment |

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
– Prospero
Mar 29 '13 at 1:54

this looks great. any idea why this patch has not been yet accepted (four years later)?
– David Portabella
Sep 6 '16 at 8:57

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
– Prospero
Mar 29 '13 at 1:54

this looks great. any idea why this patch has not been yet accepted (four years later)?
– David Portabella
Sep 6 '16 at 8:57

add a comment |

up vote
1
down vote

accepted

I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?

The solution is to setup a Node.js proxy and configure Scrapy to use
it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
all HTTP traffic.

For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code That actually works!

http.createServer(function(clientReq, clientRes) {

    var options = {

        host: clientReq.headers['host'],

        port: 80,

        path: clientReq.url,

        method: clientReq.method,

        headers: clientReq.headers

    };





    var fullUrl = clientReq.headers['host'] + clientReq.url;



    var proxyReq = http.request(options, function(proxyRes) {

        var contentType = proxyRes.headers['content-type'] || '';

        if (!contentType.startsWith('text/')) {

            proxyRes.destroy();            

            var httpForbidden = 403;

            clientRes.writeHead(httpForbidden);

            clientRes.write('Binary download is disabled.');

            clientRes.end();

        }



        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);

        proxyRes.pipe(clientRes);

    });



    proxyReq.on('error', function(e) {

        console.log('problem with clientReq: ' + e.message);

    });



    proxyReq.end();



}).listen(8080);

edited May 23 '17 at 12:40

Community♦

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

add a comment |

up vote
1
down vote

accepted

I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?

The solution is to setup a Node.js proxy and configure Scrapy to use
it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
all HTTP traffic.

For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code That actually works!

http.createServer(function(clientReq, clientRes) {

    var options = {

        host: clientReq.headers['host'],

        port: 80,

        path: clientReq.url,

        method: clientReq.method,

        headers: clientReq.headers

    };





    var fullUrl = clientReq.headers['host'] + clientReq.url;



    var proxyReq = http.request(options, function(proxyRes) {

        var contentType = proxyRes.headers['content-type'] || '';

        if (!contentType.startsWith('text/')) {

            proxyRes.destroy();            

            var httpForbidden = 403;

            clientRes.writeHead(httpForbidden);

            clientRes.write('Binary download is disabled.');

            clientRes.end();

        }



        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);

        proxyRes.pipe(clientRes);

    });



    proxyReq.on('error', function(e) {

        console.log('problem with clientReq: ' + e.message);

    });



    proxyReq.end();



}).listen(8080);

edited May 23 '17 at 12:40

Community♦

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

add a comment |

up vote
1
down vote

accepted

I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?

The solution is to setup a Node.js proxy and configure Scrapy to use
it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
all HTTP traffic.

For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code That actually works!

http.createServer(function(clientReq, clientRes) {

    var options = {

        host: clientReq.headers['host'],

        port: 80,

        path: clientReq.url,

        method: clientReq.method,

        headers: clientReq.headers

    };





    var fullUrl = clientReq.headers['host'] + clientReq.url;



    var proxyReq = http.request(options, function(proxyRes) {

        var contentType = proxyRes.headers['content-type'] || '';

        if (!contentType.startsWith('text/')) {

            proxyRes.destroy();            

            var httpForbidden = 403;

            clientRes.writeHead(httpForbidden);

            clientRes.write('Binary download is disabled.');

            clientRes.end();

        }



        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);

        proxyRes.pipe(clientRes);

    });



    proxyReq.on('error', function(e) {

        console.log('problem with clientReq: ' + e.message);

    });



    proxyReq.end();



}).listen(8080);

edited May 23 '17 at 12:40

Community♦

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?

The solution is to setup a Node.js proxy and configure Scrapy to use
it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
all HTTP traffic.

For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code That actually works!

http.createServer(function(clientReq, clientRes) {

    var options = {

        host: clientReq.headers['host'],

        port: 80,

        path: clientReq.url,

        method: clientReq.method,

        headers: clientReq.headers

    };





    var fullUrl = clientReq.headers['host'] + clientReq.url;



    var proxyReq = http.request(options, function(proxyRes) {

        var contentType = proxyRes.headers['content-type'] || '';

        if (!contentType.startsWith('text/')) {

            proxyRes.destroy();            

            var httpForbidden = 403;

            clientRes.writeHead(httpForbidden);

            clientRes.write('Binary download is disabled.');

            clientRes.end();

        }



        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);

        proxyRes.pipe(clientRes);

    });



    proxyReq.on('error', function(e) {

        console.log('problem with clientReq: ' + e.message);

    });



    proxyReq.end();



}).listen(8080);

edited May 23 '17 at 12:40

Community♦

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

edited May 23 '17 at 12:40

Community♦

edited May 23 '17 at 12:40

Community♦

edited May 23 '17 at 12:40

Community♦

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

answered Dec 25 '13 at 8:51

Omar Al-Ithawi

196117

add a comment |

up vote
0
down vote

A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`



### `--filter-mime-type=list`



Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.

If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download

something with exceptions. For example, download everything except images:



  wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*



It is also useful to download files that are compatible with an application of your system. For instance,

download every file that is compatible with LibreOffice Writer from a website using the recursive mode:



  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

add a comment |

up vote
0
down vote

A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`



### `--filter-mime-type=list`



Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.

If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download

something with exceptions. For example, download everything except images:



  wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*



It is also useful to download files that are compatible with an application of your system. For instance,

download every file that is compatible with LibreOffice Writer from a website using the recursive mode:



  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

add a comment |

up vote
0
down vote

A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`



### `--filter-mime-type=list`



Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.

If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download

something with exceptions. For example, download everything except images:



  wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*



It is also useful to download files that are compatible with an application of your system. For instance,

download every file that is compatible with LibreOffice Writer from a website using the recursive mode:



  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`



### `--filter-mime-type=list`



Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.

If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download

something with exceptions. For example, download everything except images:



  wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*



It is also useful to download files that are compatible with an application of your system. For instance,

download every file that is compatible with LibreOffice Writer from a website using the recursive mode:



  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

answered 3 hours ago

Tim Ruehsen rockdaboot

1213

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk