unique contribution of folder to disk usage
up vote
4
down vote
favorite
I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).
When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.
One option I can think of would be to use du -s
first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.
Is there an easier way?
After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:
I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.
The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).
disk-usage hard-link
add a comment |
up vote
4
down vote
favorite
I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).
When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.
One option I can think of would be to use du -s
first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.
Is there an easier way?
After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:
I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.
The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).
disk-usage hard-link
See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09
1
But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deletingdir1
saves nothing, that deletingdir2
saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
– Stéphane Chazelas
Nov 18 at 17:12
add a comment |
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).
When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.
One option I can think of would be to use du -s
first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.
Is there an easier way?
After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:
I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.
The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).
disk-usage hard-link
I have a backup containing folders for daily snapshots. To save space, identical files in different snapshots are deduplicated via hard links (generated by rsync).
When I'm running out of space, one option is to delete older snapshots. But because of the hard links, it is hard to figure out how much space I would gain by deleting a given snapshot.
One option I can think of would be to use du -s
first on all snapshot folders, then on all but the one I might delete, and the difference would give me the expected gained space. However, that's quite cumbersome and would have to be repeated when I'm trying to find a suitable snapshot for deletion.
Is there an easier way?
After trying out and thinking about the answers by Stéphane Chazelas and derobert, I realized that my question was not precise enough. Here's an attempt to be more precise:
I have a set of directories ("snapshots") which contain files which are partially storage-identical (hard linked) with files in another snapshot. I'm looking for a solution that gives me a list of the snapshots and for each the amount of used disk storage taken up by the files in it, but without that storage which is also used by a file in another snapshot. I would like to allow for the possibility that there are hard links within each snapshot.
The idea is that I can look at that list to decide which of the snapshots I should delete when I run out of space, which is a trade-off between storage space gained by deletion and value of the snapshot (e.g. based on age).
disk-usage hard-link
disk-usage hard-link
edited Nov 17 at 21:54
asked Oct 31 at 20:02
A. Donda
1558
1558
See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09
1
But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deletingdir1
saves nothing, that deletingdir2
saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
– Stéphane Chazelas
Nov 18 at 17:12
add a comment |
See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09
1
But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deletingdir1
saves nothing, that deletingdir2
saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.
– Stéphane Chazelas
Nov 18 at 17:12
See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09
See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09
1
1
But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting
dir1
saves nothing, that deleting dir2
saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.– Stéphane Chazelas
Nov 18 at 17:12
But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting
dir1
saves nothing, that deleting dir2
saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.– Stéphane Chazelas
Nov 18 at 17:12
add a comment |
3 Answers
3
active
oldest
votes
up vote
2
down vote
accepted
You could do it by hand with GNU find
:
find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.
find
prints:
1 <disk-usage>
for directories
<link-count> <disk-usage> <inode-number>
for other types of files.
We pretend the link count is always one for directories, because when in practice it's not, its because of the ..
entries, and find
doesn't list those entries, and directories generally don't have other hardlinks.
From that output, awk
counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count>
times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).
You can also use find snapshot-dir1 snapshot-dir2
to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).
If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:
find snapshot-dir* ( -path '*/*' -o -printf "%p:n" )
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).
See numfmt
to make the numbers more readable.
That assumes all files are on the same filesystem. If not, you can replace %i
with %D:%i
(if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
1
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
|
show 2 more comments
up vote
1
down vote
If your file names don't contain pattern characters or newlines, you can use find
+ du
's exclude feature to do this:
find -links +1 -type f
| cut -d/ -f2-
| du --exclude-from=- -s *
The find
bit gets all the files (-type f
) with a hardlink count greater than 1 (-links +1
). The cut
trims off the leading ./
find prints out. Then du
is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.
If it needs to work with arbitrary file names, it'd require some more scripting to replace du
(those are shell patterns, so escaping is not possible).
Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the manydu
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51
1
@A.Donda there is only onedu
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if puttingLC_ALL=C
in front ofdu
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
add a comment |
up vote
0
down vote
Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:
total unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01
Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.
Similar to Stéphane Chazelas's answer, it uses find
to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.
bash-external tools that are used: find
, xargs
, mktemp
, sort
, tput
, awk
, tr
, numfmt
, touch
, cat
, comm
, rm
. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.
If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.
To use it, save the following code to a script file duu.sh
. A short usage instruction is contained in the first comment block.
#!/bin/bash
# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.
# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi
# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"
# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done
# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"
# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "
# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done
# remove temporary files
rm -rf "$T"
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You could do it by hand with GNU find
:
find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.
find
prints:
1 <disk-usage>
for directories
<link-count> <disk-usage> <inode-number>
for other types of files.
We pretend the link count is always one for directories, because when in practice it's not, its because of the ..
entries, and find
doesn't list those entries, and directories generally don't have other hardlinks.
From that output, awk
counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count>
times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).
You can also use find snapshot-dir1 snapshot-dir2
to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).
If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:
find snapshot-dir* ( -path '*/*' -o -printf "%p:n" )
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).
See numfmt
to make the numbers more readable.
That assumes all files are on the same filesystem. If not, you can replace %i
with %D:%i
(if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
1
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
|
show 2 more comments
up vote
2
down vote
accepted
You could do it by hand with GNU find
:
find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.
find
prints:
1 <disk-usage>
for directories
<link-count> <disk-usage> <inode-number>
for other types of files.
We pretend the link count is always one for directories, because when in practice it's not, its because of the ..
entries, and find
doesn't list those entries, and directories generally don't have other hardlinks.
From that output, awk
counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count>
times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).
You can also use find snapshot-dir1 snapshot-dir2
to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).
If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:
find snapshot-dir* ( -path '*/*' -o -printf "%p:n" )
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).
See numfmt
to make the numbers more readable.
That assumes all files are on the same filesystem. If not, you can replace %i
with %D:%i
(if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
1
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
|
show 2 more comments
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You could do it by hand with GNU find
:
find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.
find
prints:
1 <disk-usage>
for directories
<link-count> <disk-usage> <inode-number>
for other types of files.
We pretend the link count is always one for directories, because when in practice it's not, its because of the ..
entries, and find
doesn't list those entries, and directories generally don't have other hardlinks.
From that output, awk
counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count>
times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).
You can also use find snapshot-dir1 snapshot-dir2
to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).
If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:
find snapshot-dir* ( -path '*/*' -o -printf "%p:n" )
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).
See numfmt
to make the numbers more readable.
That assumes all files are on the same filesystem. If not, you can replace %i
with %D:%i
(if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).
You could do it by hand with GNU find
:
find snapshot-dir -type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That counts the disk usage of files whose link count would go down to 0 after all the links found in the snapshot directory have been found.
find
prints:
1 <disk-usage>
for directories
<link-count> <disk-usage> <inode-number>
for other types of files.
We pretend the link count is always one for directories, because when in practice it's not, its because of the ..
entries, and find
doesn't list those entries, and directories generally don't have other hardlinks.
From that output, awk
counts the disk usage of the entries that have link count of 1 and also of the inodes which it has seen <link-count>
times (that is the ones whose all hard links are in the current directory and so, like the ones with a link-count of one would have their space reclaimed once the directory tree is deleted).
You can also use find snapshot-dir1 snapshot-dir2
to find out how much disk space would be reclaimed if both dirs were removed (which may be more than the sum of the space for the two directories taken individually if there are are files that are found in both and only in those snapshots).
If you want to find out how much space you would save after each snapshot-dir deletion (in a cumulated fashion), you could do:
find snapshot-dir* ( -path '*/*' -o -printf "%p:n" )
-type d -printf '1 %bn' -o -printf '%n %b %in' |
awk '/:$/ {if (NR>1) print t*512; printf "%s ", $0; next}
$1 == 1 || ++c[$3] == $1 {t+=$2;delete c[$3]}
END{print t*512}'
That processes the list of snapshots in lexical order. If you processed it in a different order, that would likely give you different numbers except for the final one (when all snapshots are removed).
See numfmt
to make the numbers more readable.
That assumes all files are on the same filesystem. If not, you can replace %i
with %D:%i
(if they're not all on the same filesystem, that would mean you'd have a mount point in there which you couldn't remove anyway).
edited Nov 18 at 16:47
answered Oct 31 at 21:54
Stéphane Chazelas
294k54553894
294k54553894
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
1
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
|
show 2 more comments
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
1
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
This seems to work, thanks, which is why I have upvoted it. As you write, I would have to repeat this, which I can do with a simple loop. However, that would take a long time (like derobert's answer) and is therefore not really practical. I believe there must be a solution to do this more effectively for many or all snapshot folders at once, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:50
1
1
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, I'm not sure what you mean. What do you want to repeat? What do you want to achieve? Do you want to know how many snapshots you need to remove to be able to reclaim say 1TB? Would you delete snapshots in sequence or based on some criteria?
– Stéphane Chazelas
Nov 2 at 16:55
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
@A.Donda, see if the edit answers your loop question.
– Stéphane Chazelas
Nov 2 at 17:18
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
The idea would be to have a list of the snapshots, each with the amount of space freed if that respective snapshot is deleted. This way I would avoid deleting snapshots that don't amount to much anyway, and not have to repeat your original command manually for each snapshot. I believe this is what you have solved with your update?
– A. Donda
Nov 4 at 1:10
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
I ended up creating my own bash script (see new answer), but I learned a lot from yours: I also use find, awk and numfmt. The difference is that I filter out inodes using comm. Thanks again!
– A. Donda
Nov 18 at 1:47
|
show 2 more comments
up vote
1
down vote
If your file names don't contain pattern characters or newlines, you can use find
+ du
's exclude feature to do this:
find -links +1 -type f
| cut -d/ -f2-
| du --exclude-from=- -s *
The find
bit gets all the files (-type f
) with a hardlink count greater than 1 (-links +1
). The cut
trims off the leading ./
find prints out. Then du
is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.
If it needs to work with arbitrary file names, it'd require some more scripting to replace du
(those are shell patterns, so escaping is not possible).
Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the manydu
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51
1
@A.Donda there is only onedu
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if puttingLC_ALL=C
in front ofdu
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
add a comment |
up vote
1
down vote
If your file names don't contain pattern characters or newlines, you can use find
+ du
's exclude feature to do this:
find -links +1 -type f
| cut -d/ -f2-
| du --exclude-from=- -s *
The find
bit gets all the files (-type f
) with a hardlink count greater than 1 (-links +1
). The cut
trims off the leading ./
find prints out. Then du
is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.
If it needs to work with arbitrary file names, it'd require some more scripting to replace du
(those are shell patterns, so escaping is not possible).
Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the manydu
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51
1
@A.Donda there is only onedu
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if puttingLC_ALL=C
in front ofdu
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
add a comment |
up vote
1
down vote
up vote
1
down vote
If your file names don't contain pattern characters or newlines, you can use find
+ du
's exclude feature to do this:
find -links +1 -type f
| cut -d/ -f2-
| du --exclude-from=- -s *
The find
bit gets all the files (-type f
) with a hardlink count greater than 1 (-links +1
). The cut
trims off the leading ./
find prints out. Then du
is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.
If it needs to work with arbitrary file names, it'd require some more scripting to replace du
(those are shell patterns, so escaping is not possible).
Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).
If your file names don't contain pattern characters or newlines, you can use find
+ du
's exclude feature to do this:
find -links +1 -type f
| cut -d/ -f2-
| du --exclude-from=- -s *
The find
bit gets all the files (-type f
) with a hardlink count greater than 1 (-links +1
). The cut
trims off the leading ./
find prints out. Then du
is asked for disk usage of every directory, excluding all the files with multiple links. Of course, once you delete a snapshot, it's possible there are now files with only one link that previously had two — so every few deletes, you really ought to re-run it.
If it needs to work with arbitrary file names, it'd require some more scripting to replace du
(those are shell patterns, so escaping is not possible).
Also, as Stéphane Chazelas points out, if there are hardlinks inside of one snapshot (all the names of the file reside within a single snapshot, not hardlinks between snapshots), those files will be excluded from the totals (even though deleting the snapshot would recover that space).
edited Nov 1 at 16:19
answered Oct 31 at 20:33
derobert
70.9k8151210
70.9k8151210
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the manydu
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51
1
@A.Donda there is only onedu
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if puttingLC_ALL=C
in front ofdu
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
add a comment |
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the manydu
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.
– A. Donda
Nov 2 at 16:51
1
@A.Donda there is only onedu
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if puttingLC_ALL=C
in front ofdu
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.
– derobert
Nov 2 at 17:04
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many
du
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.– A. Donda
Nov 2 at 16:51
This seems to work, thanks, which is why I have upvoted it. However, it takes extremely long, probably due to the many
du
calls. I believe there must be a solution to do this more effectively, which is why I don't accept your answer yet.– A. Donda
Nov 2 at 16:51
1
1
@A.Donda there is only one
du
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C
in front of du
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.– derobert
Nov 2 at 17:04
@A.Donda there is only one
du
call, but it's passed a potentially very long list of exclude patterns — that might be slowing it down. Curious if putting LC_ALL=C
in front of du
speeds it up (as long as your file names are ASCII). I fear doing this quickly needs a utility that actually tracks all the files.– derobert
Nov 2 at 17:04
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
Yes I suspect, too, that it would be necessary to write one's own tool, which basically does the same as du, build a list of files and which inodes they refer to, but them process this list differently.
– A. Donda
Nov 4 at 1:11
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
I now believe what makes your solution slow is that the list of excluded files is huge (overlap between snapshots is only on the order of 10%). I experimented with a solution which instead explicitly includes files, but for reasons I don't completely understand it wasn't really working. I then decided not to rely on du, but make a tool that creates and modifies file lists itself, see my new answer. Thanks again!
– A. Donda
Nov 18 at 1:44
add a comment |
up vote
0
down vote
Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:
total unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01
Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.
Similar to Stéphane Chazelas's answer, it uses find
to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.
bash-external tools that are used: find
, xargs
, mktemp
, sort
, tput
, awk
, tr
, numfmt
, touch
, cat
, comm
, rm
. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.
If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.
To use it, save the following code to a script file duu.sh
. A short usage instruction is contained in the first comment block.
#!/bin/bash
# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.
# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi
# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"
# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done
# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"
# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "
# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done
# remove temporary files
rm -rf "$T"
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
add a comment |
up vote
0
down vote
Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:
total unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01
Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.
Similar to Stéphane Chazelas's answer, it uses find
to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.
bash-external tools that are used: find
, xargs
, mktemp
, sort
, tput
, awk
, tr
, numfmt
, touch
, cat
, comm
, rm
. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.
If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.
To use it, save the following code to a script file duu.sh
. A short usage instruction is contained in the first comment block.
#!/bin/bash
# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.
# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi
# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"
# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done
# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"
# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "
# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done
# remove temporary files
rm -rf "$T"
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
add a comment |
up vote
0
down vote
up vote
0
down vote
Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:
total unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01
Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.
Similar to Stéphane Chazelas's answer, it uses find
to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.
bash-external tools that are used: find
, xargs
, mktemp
, sort
, tput
, awk
, tr
, numfmt
, touch
, cat
, comm
, rm
. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.
If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.
To use it, save the following code to a script file duu.sh
. A short usage instruction is contained in the first comment block.
#!/bin/bash
# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.
# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi
# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"
# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done
# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"
# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "
# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done
# remove temporary files
rm -rf "$T"
Since I wrote this answer, Stéphane Chazelas has convinced me that his answer was right all along. I leave my answer including code because it works well, too, and provides some pretty-printing. Its output looks like this:
total unique
--T---G---M---k---B --T---G---M---k---B
91,044,435,456 665,754,624 back-2018-03-01T06:00:01
91,160,015,360 625,541,632 back-2018-04-01T06:00:01
91,235,970,560 581,360,640 back-2018-05-01T06:00:01
91,474,846,208 897,665,536 back-2018-06-01T06:00:01
91,428,597,760 668,853,760 back-2018-07-01T06:00:01
91,602,767,360 660,594,176 back-2018-08-01T06:00:01
91,062,218,752 1,094,236,160 back-2018-09-01T06:00:01
230,810,647,552 50,314,291,712 back-2018-11-01T06:00:01
220,587,811,328 256,036,352 back-2018-11-12T06:00:01
220,605,425,664 267,876,352 back-2018-11-13T06:00:01
220,608,163,328 268,711,424 back-2018-11-14T06:00:01
220,882,714,112 272,000,000 back-2018-11-15T06:00:01
220,882,118,656 263,202,304 back-2018-11-16T06:00:01
220,882,081,792 263,165,440 back-2018-11-17T06:00:01
220,894,113,280 312,208,896 back-2018-11-18T06:00:01
Since I wasn't 100% happy with either of the two answers (as of 2018-11-18) – though I learned from both of them – I created my own tool and am publishing it here.
Similar to Stéphane Chazelas's answer, it uses find
to obtain a list of inodes and associated file / directory sizes, but doesn't rely on the "at most one link" heuristic. Instead, it creates a list of unique inodes (not files/directories!) for each input directory, filters out the inodes from the other directories, and them sums the remaining inodes' sizes. This way it accounts for possible hardlinks within each input directory. As a side effect, it disregards possible hardlinks from outside of the set of input directories.
bash-external tools that are used: find
, xargs
, mktemp
, sort
, tput
, awk
, tr
, numfmt
, touch
, cat
, comm
, rm
. I know, not exactly lightweight, but it does exactly what I want it to do. I share it here in case someone else has similar needs.
If anything can be done more efficiently or foolproof, comments are welcome! I'm anything but a bash master.
To use it, save the following code to a script file duu.sh
. A short usage instruction is contained in the first comment block.
#!/bin/bash
# duu
#
# disk usage unique to a directory within a set of directories
#
# Call with a list of directory names. If called without arguments,
# it operates on the subdirectories of the current directory.
# no arguments: call itself with subdirectories of .
if [ "$#" -eq 0 ]
then
exec find . -maxdepth 1 -type d ! -name . -printf '%P' | sort -z
| xargs -r --null "$0"
exit
fi
# create temporary directory
T=`mktemp -d`
# array of directory names
dirs=("$@")
# number of directories
n="$#"
# for each directory, create list of (unique) inodes with size
for i in $(seq 1 $n)
do
echo -n "reading $i/$n: ${dirs[$i - 1]} "
find "${dirs[$i - 1]}" -printf "%it%bn" | sort -u > "$T/$i"
# find %b: "The amount of disk space used for this file in 512-byte blocks."
echo -ne "r"
tput el
done
# print header
echo " total unique"
echo "--T---G---M---k---B --T---G---M---k---B"
# for each directory
for i in $(seq 1 $n)
do
# compute and print total size
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/$i"
| tr -d 'n'
| numfmt --grouping --padding 19
echo -n " "
# compute and print unique size
# create list of (unique) inodes in the other directories
touch "$T/o$i"
for j in $(seq 1 $n)
do
if [ "$j" -ne "$i" ]
then
cat "$T/$j" >> "$T/o$i"
fi
done
sort -o "$T/o$i" -u "$T/o$i"
# create list of (unique) inodes that are in this but not in the other directories
comm -23 "$T/$i" "$T/o$i" > "$T/u$i"
# sum block sizes and multiply by 512
awk '{s += $2} END{printf "%.0f", s * 512}' "$T/u$i"
| tr -d 'n'
| numfmt --grouping --padding 19
# append directory name
echo " ${dirs[$i - 1]}"
done
# remove temporary files
rm -rf "$T"
edited Nov 18 at 17:53
answered Nov 18 at 1:33
A. Donda
1558
1558
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
add a comment |
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
I think you misunderstood my answer. It doesn't rely on the "at most one link" heuristic. It counts the disk usage of inodes that would be deleted if the directory was deleted, of all the files whose all links are found in the current directory.
– Stéphane Chazelas
Nov 18 at 9:00
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
@StéphaneChazelas, it's quite possible that I didn't understand your answer, and maybe it does exactly the right thing. If so, I would like to accept it. Could you explain your code in more detail?
– A. Donda
Nov 18 at 14:58
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
See edit of my answer. Does it make it any clearer?
– Stéphane Chazelas
Nov 18 at 16:49
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f478977%2funique-contribution-of-folder-to-disk-usage%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
See also unix.stackexchange.com/questions/52876/…
– derobert
Nov 2 at 17:09
1
But again, looking at the unique disk usage of directories in isolation is not necessarily useful. You may find that deleting
dir1
saves nothing, that deletingdir2
saves nothing either, but that deleting both saves terabytes because they have large files in common that are not found elsewhere.– Stéphane Chazelas
Nov 18 at 17:12