tar takes a long time before passing data to gzip












1














What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?



I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.



While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:



GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file  2>>$log


However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.



Things I've tried to work around the issue:




  • If I just use gzip the output starts straight away, but I lose the sparse info.


  • If I use pipes, as below, it does the same thing.



    nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log



Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4 or lzop but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.



Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.



Edit: Testing my script without the --sparse option, but still using tar, does not have this problem, and the data immediately flows through to gzip.










share|improve this question




















  • 1




    tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on.
    – Kusalananda
    Dec 14 at 7:54












  • @Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it".
    – Kamil Maciorowski
    Dec 14 at 8:11










  • @Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005.
    – schily
    Dec 14 at 15:21












  • @schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code.
    – Kusalananda
    Dec 14 at 15:24
















1














What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?



I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.



While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:



GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file  2>>$log


However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.



Things I've tried to work around the issue:




  • If I just use gzip the output starts straight away, but I lose the sparse info.


  • If I use pipes, as below, it does the same thing.



    nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log



Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4 or lzop but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.



Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.



Edit: Testing my script without the --sparse option, but still using tar, does not have this problem, and the data immediately flows through to gzip.










share|improve this question




















  • 1




    tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on.
    – Kusalananda
    Dec 14 at 7:54












  • @Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it".
    – Kamil Maciorowski
    Dec 14 at 8:11










  • @Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005.
    – schily
    Dec 14 at 15:21












  • @schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code.
    – Kusalananda
    Dec 14 at 15:24














1












1








1


1





What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?



I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.



While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:



GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file  2>>$log


However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.



Things I've tried to work around the issue:




  • If I just use gzip the output starts straight away, but I lose the sparse info.


  • If I use pipes, as below, it does the same thing.



    nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log



Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4 or lzop but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.



Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.



Edit: Testing my script without the --sparse option, but still using tar, does not have this problem, and the data immediately flows through to gzip.










share|improve this question















What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?



I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.



While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:



GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file  2>>$log


However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.



Things I've tried to work around the issue:




  • If I just use gzip the output starts straight away, but I lose the sparse info.


  • If I use pipes, as below, it does the same thing.



    nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log



Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4 or lzop but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.



Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.



Edit: Testing my script without the --sparse option, but still using tar, does not have this problem, and the data immediately flows through to gzip.







tar btrfs sparse-files






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 14 at 9:01

























asked Dec 14 at 4:53









BeowulfNode42

3761411




3761411








  • 1




    tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on.
    – Kusalananda
    Dec 14 at 7:54












  • @Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it".
    – Kamil Maciorowski
    Dec 14 at 8:11










  • @Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005.
    – schily
    Dec 14 at 15:21












  • @schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code.
    – Kusalananda
    Dec 14 at 15:24














  • 1




    tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on.
    – Kusalananda
    Dec 14 at 7:54












  • @Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it".
    – Kamil Maciorowski
    Dec 14 at 8:11










  • @Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005.
    – schily
    Dec 14 at 15:21












  • @schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code.
    – Kusalananda
    Dec 14 at 15:24








1




1




tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on.
– Kusalananda
Dec 14 at 7:54






tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on.
– Kusalananda
Dec 14 at 7:54














@Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it".
– Kamil Maciorowski
Dec 14 at 8:11




@Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it".
– Kamil Maciorowski
Dec 14 at 8:11












@Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005.
– schily
Dec 14 at 15:21






@Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005.
– schily
Dec 14 at 15:21














@schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code.
– Kusalananda
Dec 14 at 15:24




@schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code.
– Kusalananda
Dec 14 at 15:24















active

oldest

votes











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f487911%2ftar-takes-a-long-time-before-passing-data-to-gzip%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f487911%2ftar-takes-a-long-time-before-passing-data-to-gzip%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Morgemoulin

Scott Moir

Souastre