wget - How to download recursively and only specific mime-types/extensions (i.e. text only)











up vote
17
down vote

favorite
7












How to download a full website, but ignoring all binary files.



wget has this functionality using the -r flag but it downloads everything and some websites are just too much for a low-resources machine and it's not of a use for the specific reason I'm downloading the site.



Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)










share|improve this question


















  • 1




    wget can only filter with file suffix
    – daisy
    Oct 31 '12 at 8:33










  • @warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
    – Omar Al-Ithawi
    Oct 31 '12 at 8:43















up vote
17
down vote

favorite
7












How to download a full website, but ignoring all binary files.



wget has this functionality using the -r flag but it downloads everything and some websites are just too much for a low-resources machine and it's not of a use for the specific reason I'm downloading the site.



Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)










share|improve this question


















  • 1




    wget can only filter with file suffix
    – daisy
    Oct 31 '12 at 8:33










  • @warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
    – Omar Al-Ithawi
    Oct 31 '12 at 8:43













up vote
17
down vote

favorite
7









up vote
17
down vote

favorite
7






7





How to download a full website, but ignoring all binary files.



wget has this functionality using the -r flag but it downloads everything and some websites are just too much for a low-resources machine and it's not of a use for the specific reason I'm downloading the site.



Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)










share|improve this question













How to download a full website, but ignoring all binary files.



wget has this functionality using the -r flag but it downloads everything and some websites are just too much for a low-resources machine and it's not of a use for the specific reason I'm downloading the site.



Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)







wget recursive download mime-types






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Oct 31 '12 at 8:15









Omar Al-Ithawi

196117




196117








  • 1




    wget can only filter with file suffix
    – daisy
    Oct 31 '12 at 8:33










  • @warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
    – Omar Al-Ithawi
    Oct 31 '12 at 8:43














  • 1




    wget can only filter with file suffix
    – daisy
    Oct 31 '12 at 8:33










  • @warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
    – Omar Al-Ithawi
    Oct 31 '12 at 8:43








1




1




wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33




wget can only filter with file suffix
– daisy
Oct 31 '12 at 8:33












@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43




@warl0ck I didn't know that, thanks! -A and -R options are very useful for my operations.
– Omar Al-Ithawi
Oct 31 '12 at 8:43










4 Answers
4






active

oldest

votes

















up vote
17
down vote













You could specify a list of allowed resp. disallowed filename patterns:



Allowed:



-A LIST
--accept LIST


Disallowed:



-R LIST
--reject LIST


LIST is comma-separated list of filename patterns/extensions.



You can use the following reserved characters to specify patterns:




  • *

  • ?

  • [

  • ]


Examples:




  • only download PNG files: -A png

  • don't download CSS files: -R css

  • don't download PNG files that start with "avatar": -R avatar*.png


If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).






share|improve this answer






























    up vote
    2
    down vote













    You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.






    share|improve this answer





















    • Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
      – Prospero
      Mar 29 '13 at 1:54










    • this looks great. any idea why this patch has not been yet accepted (four years later)?
      – David Portabella
      Sep 6 '16 at 8:57


















    up vote
    1
    down vote



    accepted










    I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?




    The solution is to setup a Node.js proxy and configure Scrapy to use
    it through http_proxy environment variable.



    What the proxy should do is:




    • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
      all HTTP traffic.

    • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
      This helps to save time, traffic and Scrapy won't crash.


    Sample Proxy Code That actually works!




    http.createServer(function(clientReq, clientRes) {
    var options = {
    host: clientReq.headers['host'],
    port: 80,
    path: clientReq.url,
    method: clientReq.method,
    headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
    var contentType = proxyRes.headers['content-type'] || '';
    if (!contentType.startsWith('text/')) {
    proxyRes.destroy();
    var httpForbidden = 403;
    clientRes.writeHead(httpForbidden);
    clientRes.write('Binary download is disabled.');
    clientRes.end();
    }

    clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
    proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
    console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

    }).listen(8080);





    share|improve this answer






























      up vote
      0
      down vote













      A new Wget (Wget2) already has feature:



      --filter-mime-type    Specify a list of mime types to be saved or ignored`

      ### `--filter-mime-type=list`

      Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards.
      If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
      something with exceptions. For example, download everything except images:

      wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*

      It is also useful to download files that are compatible with an application of your system. For instance,
      download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

      wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)


      Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.



      Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.






      share|improve this answer





















        Your Answer








        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "106"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














         

        draft saved


        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f53397%2fwget-how-to-download-recursively-and-only-specific-mime-types-extensions-i-e%23new-answer', 'question_page');
        }
        );

        Post as a guest
































        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        17
        down vote













        You could specify a list of allowed resp. disallowed filename patterns:



        Allowed:



        -A LIST
        --accept LIST


        Disallowed:



        -R LIST
        --reject LIST


        LIST is comma-separated list of filename patterns/extensions.



        You can use the following reserved characters to specify patterns:




        • *

        • ?

        • [

        • ]


        Examples:




        • only download PNG files: -A png

        • don't download CSS files: -R css

        • don't download PNG files that start with "avatar": -R avatar*.png


        If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).






        share|improve this answer



























          up vote
          17
          down vote













          You could specify a list of allowed resp. disallowed filename patterns:



          Allowed:



          -A LIST
          --accept LIST


          Disallowed:



          -R LIST
          --reject LIST


          LIST is comma-separated list of filename patterns/extensions.



          You can use the following reserved characters to specify patterns:




          • *

          • ?

          • [

          • ]


          Examples:




          • only download PNG files: -A png

          • don't download CSS files: -R css

          • don't download PNG files that start with "avatar": -R avatar*.png


          If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).






          share|improve this answer

























            up vote
            17
            down vote










            up vote
            17
            down vote









            You could specify a list of allowed resp. disallowed filename patterns:



            Allowed:



            -A LIST
            --accept LIST


            Disallowed:



            -R LIST
            --reject LIST


            LIST is comma-separated list of filename patterns/extensions.



            You can use the following reserved characters to specify patterns:




            • *

            • ?

            • [

            • ]


            Examples:




            • only download PNG files: -A png

            • don't download CSS files: -R css

            • don't download PNG files that start with "avatar": -R avatar*.png


            If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).






            share|improve this answer














            You could specify a list of allowed resp. disallowed filename patterns:



            Allowed:



            -A LIST
            --accept LIST


            Disallowed:



            -R LIST
            --reject LIST


            LIST is comma-separated list of filename patterns/extensions.



            You can use the following reserved characters to specify patterns:




            • *

            • ?

            • [

            • ]


            Examples:




            • only download PNG files: -A png

            • don't download CSS files: -R css

            • don't download PNG files that start with "avatar": -R avatar*.png


            If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Apr 13 '17 at 12:37









            Community

            1




            1










            answered Nov 27 '12 at 16:26









            unor

            576525




            576525
























                up vote
                2
                down vote













                You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.






                share|improve this answer





















                • Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
                  – Prospero
                  Mar 29 '13 at 1:54










                • this looks great. any idea why this patch has not been yet accepted (four years later)?
                  – David Portabella
                  Sep 6 '16 at 8:57















                up vote
                2
                down vote













                You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.






                share|improve this answer





















                • Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
                  – Prospero
                  Mar 29 '13 at 1:54










                • this looks great. any idea why this patch has not been yet accepted (four years later)?
                  – David Portabella
                  Sep 6 '16 at 8:57













                up vote
                2
                down vote










                up vote
                2
                down vote









                You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.






                share|improve this answer












                You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Oct 31 '12 at 17:29









                Lars Kotthoff

                814611




                814611












                • Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
                  – Prospero
                  Mar 29 '13 at 1:54










                • this looks great. any idea why this patch has not been yet accepted (four years later)?
                  – David Portabella
                  Sep 6 '16 at 8:57


















                • Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
                  – Prospero
                  Mar 29 '13 at 1:54










                • this looks great. any idea why this patch has not been yet accepted (four years later)?
                  – David Portabella
                  Sep 6 '16 at 8:57
















                Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
                – Prospero
                Mar 29 '13 at 1:54




                Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
                – Prospero
                Mar 29 '13 at 1:54












                this looks great. any idea why this patch has not been yet accepted (four years later)?
                – David Portabella
                Sep 6 '16 at 8:57




                this looks great. any idea why this patch has not been yet accepted (four years later)?
                – David Portabella
                Sep 6 '16 at 8:57










                up vote
                1
                down vote



                accepted










                I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?




                The solution is to setup a Node.js proxy and configure Scrapy to use
                it through http_proxy environment variable.



                What the proxy should do is:




                • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
                  all HTTP traffic.

                • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
                  This helps to save time, traffic and Scrapy won't crash.


                Sample Proxy Code That actually works!




                http.createServer(function(clientReq, clientRes) {
                var options = {
                host: clientReq.headers['host'],
                port: 80,
                path: clientReq.url,
                method: clientReq.method,
                headers: clientReq.headers
                };


                var fullUrl = clientReq.headers['host'] + clientReq.url;

                var proxyReq = http.request(options, function(proxyRes) {
                var contentType = proxyRes.headers['content-type'] || '';
                if (!contentType.startsWith('text/')) {
                proxyRes.destroy();
                var httpForbidden = 403;
                clientRes.writeHead(httpForbidden);
                clientRes.write('Binary download is disabled.');
                clientRes.end();
                }

                clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
                proxyRes.pipe(clientRes);
                });

                proxyReq.on('error', function(e) {
                console.log('problem with clientReq: ' + e.message);
                });

                proxyReq.end();

                }).listen(8080);





                share|improve this answer



























                  up vote
                  1
                  down vote



                  accepted










                  I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?




                  The solution is to setup a Node.js proxy and configure Scrapy to use
                  it through http_proxy environment variable.



                  What the proxy should do is:




                  • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
                    all HTTP traffic.

                  • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
                    This helps to save time, traffic and Scrapy won't crash.


                  Sample Proxy Code That actually works!




                  http.createServer(function(clientReq, clientRes) {
                  var options = {
                  host: clientReq.headers['host'],
                  port: 80,
                  path: clientReq.url,
                  method: clientReq.method,
                  headers: clientReq.headers
                  };


                  var fullUrl = clientReq.headers['host'] + clientReq.url;

                  var proxyReq = http.request(options, function(proxyRes) {
                  var contentType = proxyRes.headers['content-type'] || '';
                  if (!contentType.startsWith('text/')) {
                  proxyRes.destroy();
                  var httpForbidden = 403;
                  clientRes.writeHead(httpForbidden);
                  clientRes.write('Binary download is disabled.');
                  clientRes.end();
                  }

                  clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
                  proxyRes.pipe(clientRes);
                  });

                  proxyReq.on('error', function(e) {
                  console.log('problem with clientReq: ' + e.message);
                  });

                  proxyReq.end();

                  }).listen(8080);





                  share|improve this answer

























                    up vote
                    1
                    down vote



                    accepted







                    up vote
                    1
                    down vote



                    accepted






                    I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?




                    The solution is to setup a Node.js proxy and configure Scrapy to use
                    it through http_proxy environment variable.



                    What the proxy should do is:




                    • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
                      all HTTP traffic.

                    • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
                      This helps to save time, traffic and Scrapy won't crash.


                    Sample Proxy Code That actually works!




                    http.createServer(function(clientReq, clientRes) {
                    var options = {
                    host: clientReq.headers['host'],
                    port: 80,
                    path: clientReq.url,
                    method: clientReq.method,
                    headers: clientReq.headers
                    };


                    var fullUrl = clientReq.headers['host'] + clientReq.url;

                    var proxyReq = http.request(options, function(proxyRes) {
                    var contentType = proxyRes.headers['content-type'] || '';
                    if (!contentType.startsWith('text/')) {
                    proxyRes.destroy();
                    var httpForbidden = 403;
                    clientRes.writeHead(httpForbidden);
                    clientRes.write('Binary download is disabled.');
                    clientRes.end();
                    }

                    clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
                    proxyRes.pipe(clientRes);
                    });

                    proxyReq.on('error', function(e) {
                    console.log('problem with clientReq: ' + e.message);
                    });

                    proxyReq.end();

                    }).listen(8080);





                    share|improve this answer














                    I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?




                    The solution is to setup a Node.js proxy and configure Scrapy to use
                    it through http_proxy environment variable.



                    What the proxy should do is:




                    • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept
                      all HTTP traffic.

                    • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response.
                      This helps to save time, traffic and Scrapy won't crash.


                    Sample Proxy Code That actually works!




                    http.createServer(function(clientReq, clientRes) {
                    var options = {
                    host: clientReq.headers['host'],
                    port: 80,
                    path: clientReq.url,
                    method: clientReq.method,
                    headers: clientReq.headers
                    };


                    var fullUrl = clientReq.headers['host'] + clientReq.url;

                    var proxyReq = http.request(options, function(proxyRes) {
                    var contentType = proxyRes.headers['content-type'] || '';
                    if (!contentType.startsWith('text/')) {
                    proxyRes.destroy();
                    var httpForbidden = 403;
                    clientRes.writeHead(httpForbidden);
                    clientRes.write('Binary download is disabled.');
                    clientRes.end();
                    }

                    clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
                    proxyRes.pipe(clientRes);
                    });

                    proxyReq.on('error', function(e) {
                    console.log('problem with clientReq: ' + e.message);
                    });

                    proxyReq.end();

                    }).listen(8080);






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited May 23 '17 at 12:40









                    Community

                    1




                    1










                    answered Dec 25 '13 at 8:51









                    Omar Al-Ithawi

                    196117




                    196117






















                        up vote
                        0
                        down vote













                        A new Wget (Wget2) already has feature:



                        --filter-mime-type    Specify a list of mime types to be saved or ignored`

                        ### `--filter-mime-type=list`

                        Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards.
                        If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
                        something with exceptions. For example, download everything except images:

                        wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*

                        It is also useful to download files that are compatible with an application of your system. For instance,
                        download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

                        wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)


                        Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.



                        Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.






                        share|improve this answer

























                          up vote
                          0
                          down vote













                          A new Wget (Wget2) already has feature:



                          --filter-mime-type    Specify a list of mime types to be saved or ignored`

                          ### `--filter-mime-type=list`

                          Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards.
                          If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
                          something with exceptions. For example, download everything except images:

                          wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*

                          It is also useful to download files that are compatible with an application of your system. For instance,
                          download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

                          wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)


                          Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.



                          Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.






                          share|improve this answer























                            up vote
                            0
                            down vote










                            up vote
                            0
                            down vote









                            A new Wget (Wget2) already has feature:



                            --filter-mime-type    Specify a list of mime types to be saved or ignored`

                            ### `--filter-mime-type=list`

                            Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards.
                            If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
                            something with exceptions. For example, download everything except images:

                            wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*

                            It is also useful to download files that are compatible with an application of your system. For instance,
                            download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

                            wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)


                            Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.



                            Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.






                            share|improve this answer












                            A new Wget (Wget2) already has feature:



                            --filter-mime-type    Specify a list of mime types to be saved or ignored`

                            ### `--filter-mime-type=list`

                            Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards.
                            If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
                            something with exceptions. For example, download everything except images:

                            wget2 -r https://<site>/<document> --filter-mime-type=*,!image/*

                            It is also useful to download files that are compatible with an application of your system. For instance,
                            download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

                            wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)


                            Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.



                            Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget@gnu.org.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 3 hours ago









                            Tim Ruehsen rockdaboot

                            1213




                            1213






























                                 

                                draft saved


                                draft discarded



















































                                 


                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f53397%2fwget-how-to-download-recursively-and-only-specific-mime-types-extensions-i-e%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest




















































































                                Popular posts from this blog

                                Morgemoulin

                                Scott Moir

                                Souastre