unique contribution of folder to disk usage











up vote
4
down vote

favorite
3












I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).



When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.



One option I can think of would be to use du -s first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.



Is there an easier way?





After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:



I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.



The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).










share|improve this question
























  • See also unix.stackexchange.com/questions/52876/…
    – derobert
    Nov 2 at 17:09






  • 1




    But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting dir1 saves nothing, that deleting dir2 saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
    – Stéphane Chazelas
    Nov 18 at 17:12















up vote
4
down vote

favorite
3












I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).



When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.



One option I can think of would be to use du -s first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.



Is there an easier way?





After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:



I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.



The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).










share|improve this question
























  • See also unix.stackexchange.com/questions/52876/…
    – derobert
    Nov 2 at 17:09






  • 1




    But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting dir1 saves nothing, that deleting dir2 saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
    – Stéphane Chazelas
    Nov 18 at 17:12













up vote
4
down vote

favorite
3









up vote
4
down vote

favorite
3






3





I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).



When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.



One option I can think of would be to use du -s first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.



Is there an easier way?





After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:



I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.



The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).










share|improve this question















I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).



When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.



One option I can think of would be to use du -s first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.



Is there an easier way?





After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:



I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.



The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).







disk-usage hard-link






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 17 at 21:54

























asked Oct 31 at 20:02









A. Donda

1558




1558












  • See also unix.stackexchange.com/questions/52876/…
    – derobert
    Nov 2 at 17:09






  • 1




    But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting dir1 saves nothing, that deleting dir2 saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
    – Stéphane Chazelas
    Nov 18 at 17:12


















  • See also unix.stackexchange.com/questions/52876/…
    – derobert
    Nov 2 at 17:09






  • 1




    But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting dir1 saves nothing, that deleting dir2 saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
    – Stéphane Chazelas
    Nov 18 at 17:12
















See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09




See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09




1




1




But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting dir1 saves nothing, that deleting dir2 saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
– Stéphane Chazelas
Nov 18 at 17:12




But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting dir1 saves nothing, that deleting dir2 saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
– Stéphane Chazelas
Nov 18 at 17:12










3 Answers
3






active

oldest

votes

















up vote
2
down vote



accepted










You could do it by hand with GNU find:



find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.



find prints:





  • 1 <disk-usage> for directories


  • <link-count> <disk-usage> <inode-number> for other types of files.


We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.



From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).



You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).



If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:



find snapshot-dir* ( -path '*/*' -o -printf "%p:n" ) 
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).



See numfmt to make the numbers more readable.



That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).






share|improve this answer























  • This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:50






  • 1




    @A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
    – Stéphane Chazelas
    Nov 2 at 16:55












  • @A.Donda, see if the edit answers your loop question.
    – Stéphane Chazelas
    Nov 2 at 17:18










  • The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
    – A. Donda
    Nov 4 at 1:10










  • I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
    – A. Donda
    Nov 18 at 1:47


















up vote
1
down vote













If your file names don't contain pattern characters or newlines, you can use find + du's exclude feature to do this:



find -links +1 -type f 
| cut -d/ -f2-
| du --exclude-from=- -s *


The find bit gets all the files (-type f) with a hardlink count greater than 1 (-links +1). The cut trims off the leading ./ find prints out. Then du is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.



If it needs to work with arbitrary file names, it'd require some more scripting to replace du (those are shell patterns, so escaping is not possible).



Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).






share|improve this answer























  • This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:51








  • 1




    @A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
    – derobert
    Nov 2 at 17:04










  • Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
    – A. Donda
    Nov 4 at 1:11










  • I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
    – A. Donda
    Nov 18 at 1:44


















up vote
0
down vote













Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:



              total               unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01




Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.



Similar to Stéphane Chazelas's answer, it uses find to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.



bash-external tools that are used: find, xargs, mktemp, sort, tput, awk, tr, numfmt, touch, cat, comm, rm. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.



If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.



To use it, save the following code to a script file duu.sh. A short usage instruction is contained in the first comment block.



#!/bin/bash

# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.


# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi


# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"

# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done

# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"

# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "

# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done

# remove temporary files
rm -rf "$T"





share|improve this answer























  • I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
    – Stéphane Chazelas
    Nov 18 at 9:00










  • @StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
    – A. Donda
    Nov 18 at 14:58










  • See edit of my answer. Does it make it any clearer?
    – Stéphane Chazelas
    Nov 18 at 16:49











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f478977%2funique-contribution-of-folder-to-disk-usage%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










You could do it by hand with GNU find:



find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.



find prints:





  • 1 <disk-usage> for directories


  • <link-count> <disk-usage> <inode-number> for other types of files.


We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.



From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).



You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).



If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:



find snapshot-dir* ( -path '*/*' -o -printf "%p:n" ) 
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).



See numfmt to make the numbers more readable.



That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).






share|improve this answer























  • This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:50






  • 1




    @A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
    – Stéphane Chazelas
    Nov 2 at 16:55












  • @A.Donda, see if the edit answers your loop question.
    – Stéphane Chazelas
    Nov 2 at 17:18










  • The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
    – A. Donda
    Nov 4 at 1:10










  • I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
    – A. Donda
    Nov 18 at 1:47















up vote
2
down vote



accepted










You could do it by hand with GNU find:



find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.



find prints:





  • 1 <disk-usage> for directories


  • <link-count> <disk-usage> <inode-number> for other types of files.


We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.



From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).



You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).



If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:



find snapshot-dir* ( -path '*/*' -o -printf "%p:n" ) 
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).



See numfmt to make the numbers more readable.



That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).






share|improve this answer























  • This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:50






  • 1




    @A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
    – Stéphane Chazelas
    Nov 2 at 16:55












  • @A.Donda, see if the edit answers your loop question.
    – Stéphane Chazelas
    Nov 2 at 17:18










  • The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
    – A. Donda
    Nov 4 at 1:10










  • I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
    – A. Donda
    Nov 18 at 1:47













up vote
2
down vote



accepted







up vote
2
down vote



accepted






You could do it by hand with GNU find:



find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.



find prints:





  • 1 <disk-usage> for directories


  • <link-count> <disk-usage> <inode-number> for other types of files.


We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.



From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).



You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).



If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:



find snapshot-dir* ( -path '*/*' -o -printf "%p:n" ) 
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).



See numfmt to make the numbers more readable.



That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).






share|improve this answer














You could do it by hand with GNU find:



find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.



find prints:





  • 1 <disk-usage> for directories


  • <link-count> <disk-usage> <inode-number> for other types of files.


We pretend the link count is always one for directories, because when in practice it's not, its because of the .. entries, and find doesn't list those entries, and directories generally don't have other hardlinks.



From that output, awk counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count> times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).



You can also use find snapshot-dir1 snapshot-dir2 to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).



If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:



find snapshot-dir* ( -path '*/*' -o -printf "%p:n" ) 
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'


That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).



See numfmt to make the numbers more readable.



That assumes all files are on the same filesystem. If not, you can replace %i with %D:%i (if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 18 at 16:47

























answered Oct 31 at 21:54









Stéphane Chazelas

294k54553894




294k54553894












  • This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:50






  • 1




    @A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
    – Stéphane Chazelas
    Nov 2 at 16:55












  • @A.Donda, see if the edit answers your loop question.
    – Stéphane Chazelas
    Nov 2 at 17:18










  • The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
    – A. Donda
    Nov 4 at 1:10










  • I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
    – A. Donda
    Nov 18 at 1:47


















  • This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:50






  • 1




    @A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
    – Stéphane Chazelas
    Nov 2 at 16:55












  • @A.Donda, see if the edit answers your loop question.
    – Stéphane Chazelas
    Nov 2 at 17:18










  • The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
    – A. Donda
    Nov 4 at 1:10










  • I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
    – A. Donda
    Nov 18 at 1:47
















This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50




This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50




1




1




@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55






@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55














@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18




@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18












The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10




The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10












I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47




I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47












up vote
1
down vote













If your file names don't contain pattern characters or newlines, you can use find + du's exclude feature to do this:



find -links +1 -type f 
| cut -d/ -f2-
| du --exclude-from=- -s *


The find bit gets all the files (-type f) with a hardlink count greater than 1 (-links +1). The cut trims off the leading ./ find prints out. Then du is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.



If it needs to work with arbitrary file names, it'd require some more scripting to replace du (those are shell patterns, so escaping is not possible).



Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).






share|improve this answer























  • This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:51








  • 1




    @A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
    – derobert
    Nov 2 at 17:04










  • Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
    – A. Donda
    Nov 4 at 1:11










  • I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
    – A. Donda
    Nov 18 at 1:44















up vote
1
down vote













If your file names don't contain pattern characters or newlines, you can use find + du's exclude feature to do this:



find -links +1 -type f 
| cut -d/ -f2-
| du --exclude-from=- -s *


The find bit gets all the files (-type f) with a hardlink count greater than 1 (-links +1). The cut trims off the leading ./ find prints out. Then du is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.



If it needs to work with arbitrary file names, it'd require some more scripting to replace du (those are shell patterns, so escaping is not possible).



Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).






share|improve this answer























  • This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:51








  • 1




    @A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
    – derobert
    Nov 2 at 17:04










  • Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
    – A. Donda
    Nov 4 at 1:11










  • I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
    – A. Donda
    Nov 18 at 1:44













up vote
1
down vote










up vote
1
down vote









If your file names don't contain pattern characters or newlines, you can use find + du's exclude feature to do this:



find -links +1 -type f 
| cut -d/ -f2-
| du --exclude-from=- -s *


The find bit gets all the files (-type f) with a hardlink count greater than 1 (-links +1). The cut trims off the leading ./ find prints out. Then du is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.



If it needs to work with arbitrary file names, it'd require some more scripting to replace du (those are shell patterns, so escaping is not possible).



Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).






share|improve this answer














If your file names don't contain pattern characters or newlines, you can use find + du's exclude feature to do this:



find -links +1 -type f 
| cut -d/ -f2-
| du --exclude-from=- -s *


The find bit gets all the files (-type f) with a hardlink count greater than 1 (-links +1). The cut trims off the leading ./ find prints out. Then du is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.



If it needs to work with arbitrary file names, it'd require some more scripting to replace du (those are shell patterns, so escaping is not possible).



Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 1 at 16:19

























answered Oct 31 at 20:33









derobert

70.9k8151210




70.9k8151210












  • This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:51








  • 1




    @A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
    – derobert
    Nov 2 at 17:04










  • Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
    – A. Donda
    Nov 4 at 1:11










  • I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
    – A. Donda
    Nov 18 at 1:44


















  • This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
    – A. Donda
    Nov 2 at 16:51








  • 1




    @A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
    – derobert
    Nov 2 at 17:04










  • Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
    – A. Donda
    Nov 4 at 1:11










  • I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
    – A. Donda
    Nov 18 at 1:44
















This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51






This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many du calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51






1




1




@A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04




@A.Donda there is only one du call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C in front of du speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04












Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11




Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11












I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44




I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44










up vote
0
down vote













Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:



              total               unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01




Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.



Similar to Stéphane Chazelas's answer, it uses find to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.



bash-external tools that are used: find, xargs, mktemp, sort, tput, awk, tr, numfmt, touch, cat, comm, rm. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.



If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.



To use it, save the following code to a script file duu.sh. A short usage instruction is contained in the first comment block.



#!/bin/bash

# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.


# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi


# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"

# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done

# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"

# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "

# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done

# remove temporary files
rm -rf "$T"





share|improve this answer























  • I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
    – Stéphane Chazelas
    Nov 18 at 9:00










  • @StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
    – A. Donda
    Nov 18 at 14:58










  • See edit of my answer. Does it make it any clearer?
    – Stéphane Chazelas
    Nov 18 at 16:49















up vote
0
down vote













Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:



              total               unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01




Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.



Similar to Stéphane Chazelas's answer, it uses find to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.



bash-external tools that are used: find, xargs, mktemp, sort, tput, awk, tr, numfmt, touch, cat, comm, rm. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.



If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.



To use it, save the following code to a script file duu.sh. A short usage instruction is contained in the first comment block.



#!/bin/bash

# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.


# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi


# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"

# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done

# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"

# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "

# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done

# remove temporary files
rm -rf "$T"





share|improve this answer























  • I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
    – Stéphane Chazelas
    Nov 18 at 9:00










  • @StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
    – A. Donda
    Nov 18 at 14:58










  • See edit of my answer. Does it make it any clearer?
    – Stéphane Chazelas
    Nov 18 at 16:49













up vote
0
down vote










up vote
0
down vote









Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:



              total               unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01




Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.



Similar to Stéphane Chazelas's answer, it uses find to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.



bash-external tools that are used: find, xargs, mktemp, sort, tput, awk, tr, numfmt, touch, cat, comm, rm. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.



If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.



To use it, save the following code to a script file duu.sh. A short usage instruction is contained in the first comment block.



#!/bin/bash

# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.


# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi


# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"

# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done

# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"

# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "

# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done

# remove temporary files
rm -rf "$T"





share|improve this answer














Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:



              total               unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01




Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.



Similar to Stéphane Chazelas's answer, it uses find to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.



bash-external tools that are used: find, xargs, mktemp, sort, tput, awk, tr, numfmt, touch, cat, comm, rm. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.



If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.



To use it, save the following code to a script file duu.sh. A short usage instruction is contained in the first comment block.



#!/bin/bash

# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.


# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi


# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"

# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done

# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"

# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "

# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done

# remove temporary files
rm -rf "$T"






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 18 at 17:53

























answered Nov 18 at 1:33









A. Donda

1558




1558












  • I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
    – Stéphane Chazelas
    Nov 18 at 9:00










  • @StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
    – A. Donda
    Nov 18 at 14:58










  • See edit of my answer. Does it make it any clearer?
    – Stéphane Chazelas
    Nov 18 at 16:49


















  • I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
    – Stéphane Chazelas
    Nov 18 at 9:00










  • @StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
    – A. Donda
    Nov 18 at 14:58










  • See edit of my answer. Does it make it any clearer?
    – Stéphane Chazelas
    Nov 18 at 16:49
















I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00




I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00












@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58




@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58












See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49




See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f478977%2funique-contribution-of-folder-to-disk-usage%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Morgemoulin

Scott Moir

Souastre