(Click me to see it.)
- What is
hoardy
? - What can
hoardy
do? - On honesty in reporting of data loss issues
- Glossary
- Quickstart
- Quirks and Bugs
- Frequently Asked Questions
- I’m using
fdupes
/jdupes
now, how do I migrate to usinghoardy
? - I have two identical files, but
hoardy deduplicate
does not deduplicate them. Why? - What would happen if I run
hoardy deduplicate
with an outdated index? Wouldhoardy
loose some of my files by wrongly “deduplicating” them? - I have two files with equal
SHA256
hash digests andsize
s, and yet they are unequal when compared as binary strings. Wouldhoardy
“deduplicate” them wrongly? - What would happen if I run
hoardy deduplicate --delete
with the same directory given in two different arguments? Would it consider those files to be equivalent to themselves and delete them, losing all my data? - But what if I give the same directory to
hoardy deduplicate --delete
twice, but not as equivalent paths, but by giving one of them as a symlink into an ancestor of the other, followed by their common suffix? Will it loose my data now? - Alright, but what if I
mount --bind
a directory to another directory, thenhoardy index
and runhoardy deduplicate --delete
on both. The cloned directory will appear to be exactly the same as the original directory, but paths would be different, and there would be no symlinks involved. Sohoardy deduplicate --delete
would then detect them as duplicates and would need to delete all files from one of them. But deleting a file from one will also delete it from the other! Ha! Finally! Surely, it would loose my data now?! - Hmm, but
hoardy deduplicate
implementation looks rather complex. What if a bug there causes it to “deduplicate” some files that are not actually duplicates and loose data?
- I’m using
- Why does
hoardy
exists? - Development history
- Alternatives
- Meta
- Usage
- Development:
./test-hoardy.sh [--help] [--wine] [--fast] [default] [(NAME|PATH)]*
hoardy
is an tool for digital data hoarding, a Swiss-army-knife-like utility for managing otherwise unmanageable piles of files.
On GNU/Linux, hoardy
it pretty well-tested on my files and I find it to be an essentially irreplaceable tool for managing duplicated files in related source code trees, media files duplicated between my home directory, git-annex
, and hydrus
file object stores, as well as backup snapshots made with rsync
and rsnapshot
.
On Windows, however, hoardy
is a work in progress essentially unusable alpha software that is completely untested.
Data formats and command-line syntax of hoardy
are subject to change in future versions.
See below for why.
hoardy
can
-
record hashes and metadata of separate files and/or whole filesystem trees/hierarchies/directories, recursively, in
SQLite
databases;both one big database and/or many small ones are supported;
-
update those records incrementally by adding new filesystem trees and/or re-indexing previously added ones;
it can also re-
index
filesystem hierarchies much faster if files in its input directories only ever get added or removed, but their contents never change, which is common with backup directories (seehoardy index --no-update
); -
find duplicated files matching specified criteria, and then
-
display them,
-
replace some of the duplicated files with hardlinks to others, or
-
delete some of the duplicated files;
similarly to what
fdupes
andjdupes
do, buthoardy
won't loose your files, won't loose extended file attributes, won't leave your filesystem in an inconsistent state in case of power failure, is much faster on large inputs, can used even if you have more files than you have RAM to store their metadata, can be run incrementally without degrading the quality of results, ...; -
-
verify actual filesystem contents against file metadata and/or hashes previously recorded in its databases;
which is similar to what
RHash
can do, buthoardy
is faster on large databases of file records, can verify file metadata, and slightly more convenient to use, but, also, at the moment,hoardy
only computes and checksSHA256
hash digests and nothing else.
See the "Alternatives" section for more info.
This document mentions data loss and situations when it could occur, repeatedly. I realize that this may turn some people off. Unfortunately, the reality is that with modern computing it's quite easy to screw things up. If a tool can delete or overwrite data, it can loose data. Hence, make backups!
With that said, hoardy
tries its very best to make situations where it causes data loss impossible by doing a ton of paranoid checks before doing anything destructive.
Unfortunately, the set of situations where it could lose some data even after doing all those checks is not empty.
Which is why "Quirks and Bugs" section documents all of those situations known to me.
(So... Make backups!)
Meanwhile, "Frequently Asked Questions", among other things, documents various cases that are handled safely.
Most of those are quite non-obvious and not recognized by other tools, which will loose your data where hoardy
would not.
As far as I know, hoardy
is actually the safest tool for doing what it does, but this document mentions data loss repeatedly, while other tools prefer to be quiet about it.
I've read the sources of hoardy
's alternatives to make those comparisons there, and to figure out if I maybe should change how hoardy
does some things, and I became much happier with hoardy
's internals as a result.
Just saying.
Also, should I ever find an issue in hoardy
that produces loss off data, I commit to fixing and honestly documenting it all immediately, and then adding new tests to the test suite to prevent that issues in the future.
A promise that can be confirmed by the fact that I did such a thing before for hoardy-web
tool, see its tool-v0.18.1
release.
-
Inode is a physical unnamed files.
Directories reference them, giving them names.
Different directories, or different names in the same directory, can refer to the same inode, making that file available under different names.
Editing such a file under one name will change its content under all the other names too.
-
nlinks
is the number of times an inode is referenced by all the directories on a filesystem.
See man 7 inode
for more info.
-
Install
Python 3
:- On a conventional POSIX system like most GNU/Linux distros and MacOS X: Install
python3
via your package manager. Realistically, it probably is installed already.
- On a conventional POSIX system like most GNU/Linux distros and MacOS X: Install
-
On a POSIX system:
Open a terminal, install this with
pip install hoardy
and run as
hoardy --help
-
Alternatively, for light development (without development tools, for those see
nix-shell
below):Open a terminal/
cmd.exe
,cd
into this directory, then install withpython -m pip install -e . # or pip install -e .
and run as:
python -m hoardy --help # or hoardy --help
-
Alternatively, on a system with Nix package manager
nix-env -i -f ./default.nix hoardy --help
Though, in this case, you'll probably want to do the first command from the parent directory, to install everything all at once.
-
Alternatively, to replicate my development environment:
nix-shell ./default.nix --arg developer true
So, as the simplest use case, deduplicate your ~/Downloads
directory.
Index your ~/Downloads
directory:
hoardy index ~/Downloads
Look at the list of duplicated files there:
hoardy find-dupes ~/Downloads
Deduplicate them by hardlinking each duplicate file to its oldest available duplicate version, i.e. make all paths pointing to duplicate files point to the oldest available inode among those duplicates:
hoardy deduplicate --hardlink ~/Downloads
# or, equivalently
hoardy deduplicate ~/Downloads
The following should produce an empty output now:
hoardy find-dupes ~/Downloads
If it does not (which is unlikely for ~/Downloads
), then some duplicates have different metadata (permissions, owner, group, extended attributes, etc), which will be discussed below.
By default, both deduplicate --hardlink
and find-dupes
run with implied --min-inodes 2
option.
Thus, to see paths that point to the same inodes on disk you'll need to run the following instead:
hoardy find-dupes --min-inodes 1 ~/Downloads
To delete all but the oldest file among duplicates in a given directory, run
hoardy deduplicate --delete ~/Downloads
in which case --min-inodes 1
is implied by default.
The result of which could, of course, have been archived by running this last command directly, without doing all of the above except for index
.
Personally, I have
hoardy index ~/Downloads && hoardy deduplicate --delete ~/Downloads
scheduled in my daily crontab
, because I frequently re-download files from local servers while developing things (for testing).
Normally, you probably don't need to run it that often.
Assuming you have a bunch of directories that were produced by something like
rsync -aHAXivRyy --link-dest=/backup/yesterday /home /backup/today
you can deduplicate them by running
hoardy index /backup
hoardy deduplicate /backup
(Which will probably take a while.)
Doing this will deduplicate everything by hardlinking each duplicate file to an inode with the oldest mtime
while respecting and preserving all file permissions, owners, groups, and user
extended attributes.
If you run it as super-user it will also respect all other extended name-spaces, like ACLs, trusted extended attributes, etc.
See man 7 xattr
for more info.
But, depending on your setup and wishes, the above might not be what you'd want to run. For instance, personally, I run
hoardy index /backup
hoardy deduplicate --reverse --ignore-meta /backup
instead.
Doing this hardlinks each duplicate file to an inode with the latest mtime
(--reverse
) and ignores all file metadata (but not extended attributes), so that the next
rsync -aHAXivRyy --link-dest=/backup/today /home /backup/tomorrow
could re-use those inodes via --link-dest
as much as possible again.
Without those options the next rsync --link-dest
would instead re-create many of those inodes again, which is not what I want, but your mileage may vary.
Also, even with --reverse
the original mtime
of each path will be kept in the hoardy
's database so that it could be restored later.
(Which is pretty cool, right?)
Also, if you have so many files under /backup
that deduplicate
does not fit into RAM, you can still run it incrementally (while producing the same deduplicated result) via sharding by SHA256
hash digest.
See examples for more info.
Note however, that simply running hoardy deduplicate
on your whole $HOME
directory will probably break almost everything, as many programs depend on file timestamps not moving backwards, use zero-length or similarly short files for various things, overwrite files without copying them first, and expect them to stay as independent inodes.
Hardlinking different same-data files together on a non-backup filesystem will break all those assumptions.
(If you do screw it up, you can fix it by simply doing cp -a file file.copy ; mv file.copy file
for each wrongly deduplicated file.)
However, sometimes deduplicating some files under $HOME
can be quite useful, so hoardy
implements a fairly safe way to do it semi-automatically.
Index your home directory and generate a list of all duplicated files, matched strictly, like deduplicate
would do:
hoardy index ~
hoardy find-dupes --print0 --match-meta ~ > dupes.print0
--print0
is needed here because otherwise file names with newlines and/or weird symbols in them could be parsed as multiple separate paths and/or mangled.
By default, without --print0
, hoardy
solves this by escaping control characters in its outputs, and, in theory, it could then allow to read back its own outputs using that format.
But normal UNIX tools won't be able to use them, hence --print0
, which is almost universally supported.
You can then easily view the resulting file from a terminal with:
cat dupes.print0 | tr '\0' '\n' | less
which, if none of the paths have control symbols in them, will be equivalent to the output of:
hoardy find-dupes --match-meta ~ | less
But you can now use grep
or another similar tool to filter those outputs.
Say, for example, you want to deduplicate git
objects across different repositories:
grep -zP '/\.git/objects/([0-9a-f]{2}|pack)/' dupes.print0 > git-objects.print0
cat git-objects.print0 | tr '\0' '\n' | less
These are never modified, as so they can be hardlinked together.
In fact, git
does this silently when it notices, so you might not get a lot of duplicates there, especially if you mostly clone local repositories from each other.
But if you have several related repositories cloned from external sources at $HOME
, the above output, most likely, will not be empty.
So, you can now pretend to deduplicate all of those files:
hoardy deduplicate --dry-run --stdin0 < git-objects.print0
and then actually do it:
hoardy deduplicate --stdin0 < git-objects.print0
Ta-da! More disk space! For free!
Of course, the above probably won't have deduplicated much.
However, if you use npm
lots, then your filesystem is probably chock full of node_modules
directories full of files that can be deduplicated.
In fact, pnpm
tool does this automatically when installing new stuff, but it won't help with the previously installed stuff.
Whereas hoardy
can help:
grep -zF '/node_modules/' dupes.print0 > node_modules.print0
cat node_modules.print0 | tr '\0' '\n' | less
hoardy deduplicate --stdin0 < node_modules.print0
Doing this could save quite a bit of space, since nodejs
packages tend to duplicate everything dozens of times.
... and then duplicate them on-demand while editing.
Personally, I use git worktree
s a lot.
That is, usually, I clone a repo, make a feature branch, check it out into a separate worktree, and work on it there:
git clone --origin upstream url/to/repo repo
cd repo
git branch feature-branch
git worktree add feature feature-branch
cd feature
# now working on a feature-branch
# ....
Meanwhile, in another TTY, I check out successive testable revisions and test them in a separate nix-shell
session
cd ~/src/repo
hash=$(cd ~/src/repo/feature; git rev-parse HEAD)
git worktree add testing $hash
cd testing
nix-shell ./default.nix
# run long-running tests here
# when feature-branch updated lots
hash=$(cd ~/src/repo/feature; git rev-parse HEAD)
git checkout $hash
# again, run long-running tests here
which allows me to continue working on feature-branch
without interruptions while the tests are being run on a frozen worktree, which eliminates a whole class of testing errors.
With a bit of conscientiousness, it also allows me to compare feature-branch
to the latest revision that passed all the tests very easily.
Now, this workflow costs almost nothing for small projects, but for Nixpkgs, Firefox, or the Linux Kernel each worktree checkout takes quite a bit of space. If you have dozens of feature-branches, then space usage can be quite horrifying.
But hoardy
and Emacs
can help!
Emacs
with break-hardlink-on-save
variable set to t
(M-x customize-variable break-hardlink-on-save
) will always re-create and then rename
files when writing buffers to disk, always breaking hardlinks.
I.e., with it enabled, Emacs
won't be overriding any files in-place, ever.
This has safety advantages, so that, e.g., power loss won't loose your data even if your Emacs
happened to be writing out a huge org-mode
file to disk at that moment.
Which is nice.
But enabling that option also allows you to simply hoardy deduplicate
all source files on your filesystem without care.
That is, I have the above variable set in my Emacs
config, I run
hoardy index ~/src/nixpkgs/* ~/src/firefox/* ~/src/linux/*
hoardy deduplicate ~/src/nixpkgs/* ~/src/firefox/* ~/src/linux/*
periodically, and let my Emacs
duplicate files I actually touch, on-demand.
For Vim
, the docs say, the following setting in .vimrc
should produce the same effect:
set backupcopy=no,breakhardlink
but I tried it, and it does not work.
(You can try it yourself:
cd /tmp
echo test > test-file
ln test-file test-file2
vim test-file2
# edit it
# :wq
ls -l test-file test-file2
The files should be different, but on my system they stay hardlinked.)
-
hoardy
databases take up quite a bit of space.This will be fixed with database format
v4
, which will store file trees instead of plain file tables indexed by paths. -
When a previously indexed file or directory can't be accessed due to file modes/permissions,
hoardy index
will remove it from the database.This is a design issue with the current scanning algorithm which will be solved after database format
v4
.At the moment, it can be alleviated by running
hoardy index
with--no-remove
option. -
By default,
hoardy index
requires its input files to live on a filesystem which either has persistent inode numbers or reports all inode numbers as zeros.I.e., by default,
index
ing files from a filesystem likeunionfs
orsshfs
, which use dynamic inode numbers, will produce broken index records.Filesystems like that can still be indexed with
--no-ino
option set, but there's no auto-detection for this option at the moment.Though, brokenly
index
ed trees can be fixed by simply re-index
ing with--no-ino
set. -
When
hoardy
is running, mounting a new filesystem into a directory given as itsINPUT
s could break some things in unpredictable ways, makinghoardy
report random files as having broken metadata.No data loss should occur in this case while
deduplicate
is running, but the outputs offind-duplicates
could become useless.
-
Files changing at inconvenient times while
hoardy
is running could make it lose either the old or the updated version of each such file.Consider this:
hoardy deduplicate
(--hardlink
or--delete
) discoverssource
andtarget
files to be potential duplicates,- checks
source
andtarget
files to have equal contents, - checks their file metadata, they match its database state,
- "Okay!", it thinks, "Let's deduplicate them!"
- but the OS puts
hoardy
to sleep doing its multi-tasking thing, - another program sneaks in and sneakily updates
source
ortarget
, - the OS wakes
hoardy
up, hoardy
proceeds to deduplicate them, loosing one of them.
hoardy
callslstat
just before each file is--hardlink
ed or--delete
d, so this situation is quite unlikely and will be detected with very high probability, but it's not impossible.If it does happen,
hoardy
running with default settings will loose the updated version of the file, unless--reverse
option is set, in which case it will loose be the oldest one instead.I know of no good solution to fix this. As far as I know, all alternatives suffer from the same issue.
Technically, on Linux, there's a partial workaround for this via
renameat2
syscall withRENAME_EXCHANGE
flag, which is unused by bothhoardy
and all similar tools at the moment, AFAICS.On Windows, AFAIK, there's no way around this issue at all.
Thus, you should not
deduplicate
directories with files that change.
-
hoardy find-dupes
usually produces the same results asjdupes --recurse --zeromatch --order time
. -
hoardy deduplicate --hardlink
is a replacement forjdupes --recurse --zeromatch --permissions --order time --linkhard --noprompt
. -
hoardy deduplicate --delete
is a replacement forjdupes --recurse --zeromatch --permissions --order time --hardlinks --delete --noprompt
.
By default, files must match in everything but timestamps for hoardy deduplicate
to consider them to be duplicates.
In comparison, hoardy find-duplicates
considers everything with equal SHA256
hash digest and size
s to be duplicates instead.
It works this way because hoardy find-duplicates
is designed to inform you of all the potential things you could deduplicate while hoardy deduplicate
is designed to preserve all metadata by default (hoardy deduplicate --hardlink
also preserves the original file mtime
in the database, so it can be restored later).
If things like file permissions, owners, and groups are not relevant to you, you can run
hoardy deduplicate --ignore-meta path/to/file1 path/to/file2
to deduplicate files that mismatch in those metadata fields.
(If you want to control this more precisely, see deduplicate
's options.)
If even that does not deduplicate your files, and they are actually equal as binary strings, extended file attributes must be different. At the moment, if you are feeling paranoid, you will need to manually do something like
# dump them all
getfattr --match '.*' --dump path/to/file1 path/to/file2 > attrs.txt
# edit the result so that records of both files match
$EDITOR attrs.txt
# write them back
setfattr --restore=attrs.txt
after which hoardy deduplicate --ignore-meta
would deduplicate them (if they are indeed duplicates).
(Auto-merging of extended attributes, when possible, is on the "TODO" list.)
What would happen if I run hoardy deduplicate
with an outdated index? Would hoardy
loose some of my files by wrongly "deduplicating" them?
No, it would not.
hoardy
checks that each soon-to-be deduplicate
d file from its index matches its filesystem counterpart, printing an error and skipping that file and all its apparent duplicates if not.
I have two files with equal SHA256
hash digests and size
s, and yet they are unequal when compared as binary strings. Would hoardy
"deduplicate" them wrongly?
No, it would not.
hoardy
checks that source and target inodes have equal data contents before hardlinking them.
What would happen if I run hoardy deduplicate --delete
with the same directory given in two different arguments? Would it consider those files to be equivalent to themselves and delete them, losing all my data?
Nope, hoardy
will notice the same path being processed twice and ignore the second occurrence, printing a warning.
But what if I give the same directory to hoardy deduplicate --delete
twice, but not as equivalent paths, but by giving one of them as a symlink into an ancestor of the other, followed by their common suffix? Will it loose my data now?
Nope, hoardy
will detect this too by resolving all of its inputs first.
Alright, but what if I mount --bind
a directory to another directory, then hoardy index
and run hoardy deduplicate --delete
on both. The cloned directory will appear to be exactly the same as the original directory, but paths would be different, and there would be no symlinks involved. So hoardy deduplicate --delete
would then detect them as duplicates and would need to delete all files from one of them. But deleting a file from one will also delete it from the other! Ha! Finally! Surely, it would loose my data now?!
Nope, hoardy
will detect this and skip all such files too.
Before acting hoardy deduplicate
checks that if source
and target
point to the same file on the same device then it's nlinks
is not 1
.
If both source
and target
point to the same last copy of a file, it will not be acted upon.
Note that hoardy
does this check not only in --delete
mode, but also in --hardlink
mode, since re-linking them will simply produce useless link
+rename
churn and disk IO.
Actually, if you think about it, this check catches all other possible issues of "removing the last copy of a file when we should not" kind, so all other similar "What if" questions can be answered by "in the worst case, it will be caught by that magic check and at least one copy of the file will persist". And that's the end of that.
As far as I know, hoardy
is the only tool in existence that handles this properly.
Probably because I'm rare in that I like using mount --bind
s at $HOME
.
(They are useful in places where you'd normally want to hardlink directories, but can't because POSIX disallows it.
For instance, vendor/kisstdlib
directory here is a mount --bind
on my system, so that I could ensure all my projects work with its latest version without fiddling with git
.)
And so I want hoardy
to work even while they are all mounted.
Hmm, but hoardy deduplicate
implementation looks rather complex. What if a bug there causes it to "deduplicate" some files that are not actually duplicates and loose data?
Firstly, a healthy habit to have is to simply not trust any one tool to not loose your data, make a backup (including of your backups) before running hoardy deduplicate
first.
(E.g., if you are feeling very paranoid, you can run rsync -aHAXiv --link-dest=source source copy
to make a hardlink-copy or cp -a --reflink=always source copy
to make a reflink-copy first.
On a modern filesystem these cost very little.
And you can later remove them to save the space used by inodes, e.g., after you hoardy verify
ed that nothing is broken.)
Secondly, I'm pretty sure it works fine as hoardy
has quite a comprehensive test suite for this and is rather well-tested on my backups.
Thirdly, the actual body of hoardy deduplicate
is written in a rather paranoid way re-verifying all assumptions before attempting to do anything.
Fourthly, by default, hoardy deduplicate
runs with --paranoid
option enabled, which checks that source and target have equal contents before doing anything to a pair of supposedly duplicate files, and emits errors if they are not.
This could be awfully inefficient, true, but in practice it usually does not matter as on a reasonably powerful machine with those files living on an HDD the resulting content re-checks get eaten by IO latency anyway.
Meanwhile, --paranoid
prevents data loss even if the rest of the code is completely broken.
With --no-paranoid
is still checks file content equality, but once per every new inode, not for each pair of paths.
Eventually --no-paranoid
will probably become the default (when I stop editing all that code and fearing I would accidentally break something).
Which, by the way, is the reason why hoardy deduplicate
looks rather complex.
All those checks are not free.
So, since I'm using this tool extensively myself on my backups which I very much don't want to later restore from their cold backups, I'm pretty paranoid at ensuring it does not loose any data. It should be fine.
That is, I've been using hoardy
to deduplicate files inside my backup directories, which contain billions of files spanning decades, since at least 2020.
So far, for me, bugs in hoardy
caused zero data loss.
Originally, I made hoardy
as a replacement for its alternatives so that I could:
-
Find files by hash, because I wanted to easily open content-addressed links in my org-mode files.
-
Efficiently deduplicate files between different backups produced by
rsync
/rsnapshot
:rsync -aHAXivRyy --link-dest=/backup/yesterday /home /backup/today
since
rsync
does not handle file movements and renames very well, even with repeated--fuzzy/-y
(see itsman
page for more info). -
Efficiently deduplicate per-app backups produced by
hoardy-adb
:hoardy-adb split backup.ab
-
Efficiently deduplicate files between all of the above and
.git/objects
of related repositories,.git/annex/objects
produced bygit-annex
,.local/share/hydrus/files
produced byhydrus
, and similar, in cases where they all live on the same filesystem.The issue here is that
git-annex
,hydrus
, and similar tools copy files into their object stores, even when the files you feed them are read-only and can be hardlinked instead. Which, usually, is a good thing preventing catastrophic consequences of user errors. But I never edit read-only files, I do backups of backups, and, in general, I know what I'm doing, thank you very much, so I'd like to save my disk space instead, please.
"But ZFS
/BTRFS
solves this!" I hear you say?
Well, sure, such filesystems can deduplicate data blocks between different files (though, usually, you have to make a special effort to archive this as, by default, they do not), but how much space gets wasted to store the inodes?
Let's be generous and say an average inode takes 256 bytes (on a modern filesystems it's usually 512 bytes or more, which, by the way, is usually a good thing, since it allows small files to be stored much more efficiently by inlining them into the inode itself, but this is awful for efficient storage of backups).
My home directory has ~10M files in it (most of those are emails and files in source repositories, and this is the minimum I use all the time, I have a bunch more stuff on external drives, but it does not fit onto my SSD), thus a year of naively taken daily rsync
-backups would waste (256 * 10**7 * 365) / (1024 ** 3) = 870.22
GiB in inodes alone.
Sure, rsync --link-dest
will save a bunch of that space, but if you move a bunch of files, they'll get duplicated.
In practice, the last time I deduplicated a never-before touched pristine rsnapshot
hierarchy containing backups of my $HOME
it saved me 1.1 TiB of space.
Don't you think you would find a better use for 1.1TiB of additional space than storing useless inodes?
Well, I did.
"But fdupes
and its forks solve this!" I hear you say?
Well, sure, but the experience of using them in the above use cases of deduplicating mostly-read-only files is quite miserable.
See the "Alternatives" section for discussion.
Also, I wanted to store the oldest known mtime
for each individual path, even when deduplicate
-hardlinking all the copies, so that the exact original filesystem tree could be re-created from the backup when needed.
AFAIK, hoardy
is the only tool that does this.
Yes, this feature is somewhat less useful on modern filesystems which support reflink
s (Copy-on-Write lightweight copies), but even there, a reflink
takes a whole inode, while storing an mtime
in a database takes <= 8
bytes.
Also, in general, indexing, search, duplicate discovery, set operations, send-receive from remote nodes, and application-defined storage APIs (like HTTP
/WebDAV
/FUSE
/SFTP
), can be combined to produce many useful functions.
It's annoying there appears to be no tool that can do all of those things on top of a plain file hierarchy.
All such tools known to me first slurp all the files into their own object stores, and usually store those files quite less efficiently than I would prefer, which is annoying.
See the "Wishlist" for more info.
This version of hoardy
is a minimal valuable version of my privately developed tool (referred to as "bootstrap version" in commit messages), taken at its version circa 2020, cleaned up, rebased on top of kisstdlib
, slightly polished, and documented for public display and consumption.
The private version has more features and uses a much more space-efficient database format, but most of those cool new features are unfinished and kind of buggy, so I was actually mostly using the naive-database-formatted bootstrap version in production.
So, I decided to finish generalizing the infrastructure stuff to kisstdlib
first, chop away everything related to v4
on-disk format and later, and then publish this part first.
(Which still took me two months of work. Ridiculous!)
The rest is currently a work in progress.
If you'd like all those planned features from the the "TODO" list and the "Wishlist" to be implemented, sponsor them. I suck at multi-tasking and I need to eat, time spent procuring sustenance money takes away huge chunks of time I could be working on this and other related projects.
fdupes
is the original file deduplication tool.
It walks given input directories, hashes all files, groups them into potential duplicate groups, then compares the files in each group as binary strings, and then deduplicates the ones that match.
jdupes
is a fork of fdupes
that does duplicate discovery more efficiently by hashing as little as possible, which works really well on an SSD or when your files contain very small number of duplicates.
But in other situations, like with a file hierarchy with tons of duplicated files living on an HDD, it works quite miserably, since it generates a lot of disk seek
s by doing file comparisons incrementally.
Meanwhile, since the fork, fdupes
added hashing into an SQLite
database, similar to what hoardy
does.
Comparing hoardy
, fdupes
, and jdupes
I notice the following:
-
hoardy
will not loose your data.hoardy
will refuse to delete a last known copy of a file, it always checks that at least one copy of content data of each file it processes will still be available after it finishes doing whatever it's doing.fdupes
andjdupes
will happily delete everything if you ask, and it's quite easy to ask accidentally, literally a single key press.Also, they will happily delete your data in some of the situations discussed in "Frequently Asked Questions", even if you don't ask.
Yes, usually, they work fine, but I recall restoring data from backups multiple times after using them.
-
Unlike with
jdupes
, filesystem changes done byhoardy deduplicate
are atomic with respect to power being lost.hoardy
implements--hardlink
bylink
ingsource
to atemp
file neartarget
, and thenrename
ing it to thetarget
, which, on a journaled filesystem, is atomic. Thus, after a power loss, either thesource
or thetarget
will be in place oftarget
.jdupes
renames thetarget
file totemp
,link source target
, and thenrm temp
instead. This is not atomic. Also, it probably does this to improve safety, but it does not actually help, since if thetarget
is open by another process, that process can still write into there after therename
anyway.fdupes
does not implement--hardlink
at all. -
hoardy
is aware of extended file attributes and won't ignore or loose them, unless you specifically ask.Meanwhile, both
fdupes
andjdupes
ignore and then usually loose them when deduplicating. -
jdupes
re-starts from zero it gets interrupted, whilefdupes
andhoardy
keep most of the progress on interrupt.jdupes
has--softabort
option which helps with this issue somewhat, but it won't help if your machine crashes or loses power in the middle.fdupes
lacks hardlinking support,jdupes
takes literally months of wall-time to finish on my backups, even with files less that 1 MiB excluded, so both tools essentially unusable for my use case.But, if you have a smallish bunch of files sitting on an SSD, like a million or less and you want to deduplicate them once and then never again, like if you are a computer service technician or something, then
jdupes
is probably the best solution then.Meanwhile, both
fdupes
andhoardy index
index all files into a database once, which does take quite a bit of time, but for billion-file hierarchies it takes days, not months, since all those files get accessed linearly. And that process can be interrupted at any time, including with a power loss, without losing most of the progress. -
Both
fdupes
andhoardy
can apply incremental updates to alreadyindex
ed hierarchies, which take little time to re-index
, assuming file sizes and/ormtime
s change as they should.Except,
hoardy
allows you to optionally tweak itsindex
algorithm to save bunch of disk accesses when run on file hierarchies where files only ever get added or removed, but their contents never change, which is common with backup directories, seehoardy index --no-update
.Meanwhile,
fdupes
does not support this latter feature andjdupes
does not support database indexes at all. -
hoardy
can both dump the outputs offind-dupes --print0
and load them back withdeduplicate --stdin0
allowing you to filter files it would deduplicate easily.With small number of files you can run
xargs -0 fdupes
,xargs -0 jdupes
, or some such, but for large numbers it won't work.The number of inputs you can feed into
hoardy
is limited by your RAM, not by OS command line argument list size limit.Neither of
fdupes
orjdupes
can do this. -
hoardy deduplicate
can shard its inputs, allowing it to work with piles of files large enough so that even their metadata alone does not to fit into RAM.Or, you can that feature to run
deduplicate
on duplicate-disjoint self-contained chunks of its database, i.e. "dedupicate me about 1/5 of all duplicates, please, taking slightly more than 1/5 of the time of the whole thing", without degrading the quality of the results.I.e., with
fdupes
andjdupes
you can shard by running them on subsets of your inputs. But then, files shared by different inputs won't be deduplicated between them. In contrast,hoardy
can do sharding bySHA256
, which will result in everything being properly deduplicated.See examples below.
Neither of
fdupes
orjdupes
can do this. -
Both
fdupes
andhoardy
are faster thanjdupes
on large inputs, especially on HDDs.Both
fdupes
andhoardy deduplicate
use indexed hashes to find pretty good approximate sets of potential duplicates very quickly on large inputs and walks the filesystem mostly linearly, which greatly improves performance on an HDD.In practice, I have not yet managed to become patient enough for
jdupes
to finish deduplicating my whole backup directory once, and I once left it running for two months.Meanwhile, on my backups,
hoardy index
takes a couple of days, whilehoardy deduplicates
takes a couple of weeks of wall time, which can easily be done incrementally with sharding, see examples.fdupes
does not support hardlinking, and I'm not motivated enough to copy my whole backup hierarchy and run it, comparing its outputs tohoardy deduplicate --delete
. -
Also, with both
fdupes
andhoardy
re-deduplication will skip re-doing most of the work. -
hoardy deduplicate
is very good at RAM usage.It uses the database to allow a much larger working set to fit into RAM, since it can unload file metadata from RAM and re-load it later from the database again at any moment.
Also, it pre-computes hash usage counts and then uses them to report progress and evict finished duplicate groups from memory as soon as possible. So, in practice, on very large inputs, it will first eat a ton of memory (which, if it's an issue, can be solved by sharding), but then it will rapidly processes and discards duplicate candidates groups, making all that memory available to other programs rather quickly again.
Meaning, you can feed it a ton of whole-system backups made with
rsync
spanning decades, and it will work, and it will deduplicate them using reasonable amounts time and memory.fdupes
has the--immediate
option which performs somewhat similarly, but at the cost of losing all control about which files get deleted.hoardy
is good by default, without compromises.jdupes
can't do this at all. -
Unlike
fdupes
andjdupes
,hoardy find-dupes
reports same-hash+length files as duplicates even if they do not match as binary strings, which might not be what you want.Doing this allows
hoardy find-dupes
to compute potential duplicates without touching the indexed file hierarchies at all (when running with its default settings), improving performance greatly.On non-malicious files of sufficient size, the default
SHA256
hash function makes hash collisions highly improbable, so it's not really an issue, IMHO. Butfdupes
andjdupes
are technically better at this.hoardy deduplicate
does check file equality properly before doing anything destructive, similar tofdupes
/jdupes
, so hash collisions will not loose your data, buthoardy find-dupes
will still list such files as duplicates.
In short, hoardy
implements almost a union of features of both fdupes
and jdupes
, with some more useful features on top, but with some little bits missing here and there, but hoardy
is also significantly safer to use than either of the other two.
RHash
is "recursive hasher".
Basically, you give it a list of directories, it outputs <hash digest> <path>
lines (or similar, it's configurable), then, later, you can verify files against a file consisting of such lines.
It also has some nice features, like hashing with many hashes simultaneously, skipping of already-hashed files present in the output file, and etc.
Practically speaking, it's usage is very similar to hoardy index
followed by hoardy verify
, except
-
RHash
can compute way more hash functions thanhoardy
(at the moment,hoardy
only ever computesSHA256
); -
for large indexed file hierarchies
hoardy
is much faster at updating its indexes, since, unlike plain-text files generated byrhash
,SQLite
databases can be modified easily and incrementally;also, all the similar
index
ing advantages from the previous subsection apply; -
hoardy verify
can verify both hashes and file metadata; -
hoardy
's CLI is more convenient thanRHash
's CLI, IMHO.
Many years before hoardy
was born, I was using RHash
quite extensively (and I remember the original forum it was discussed/developed at, yes).
See CHANGELOG.md
.
See above, also the bottom of CHANGELOG.md
.
LGPLv3+ (because it will become a library, eventually).
Contributions are accepted both via GitHub issues and PRs, and via pure email.
In the latter case I expect to see patches formatted with git-format-patch
.
If you want to perform a major change and you want it to be accepted upstream here, you should probably write me an email or open an issue on GitHub first. In the cover letter, describe what you want to change and why. I might also have a bunch of code doing most of what you want in my stash of unpublished patches already.
A thingy for hoarding digital assets.
-
options:
--version
: show program's version number and exit-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown-d DATABASE, --database DATABASE
: database file to use; default:~/.local/share/hoardy/index.db
on POSIX,%LOCALAPPDATA%\hoardy\index.db
on Windows--dry-run
: perform a trial run without actually performing any changes
-
output defaults:
--color
: set defaults to--color-stdout
and--color-stderr
--no-color
: set defaults to--no-color-stdout
and--no-color-stderr
-
output:
--color-stdout
: colorstdout
output using ANSI escape sequences; default whenstdout
is connected to a TTY and environment variables do not setNO_COLOR=1
--no-color-stdout
: produce plain-textstdout
output without any ANSI escape sequences--color-stderr
: colorstderr
output using ANSI escape sequences; default whenstderr
is connected to a TTY and environment variables do not setNO_COLOR=1
--no-color-stderr
: produce plain-textstderr
output without any ANSI escape sequences--progress
: report progress tostderr
; default whenstderr
is connected to a TTY--no-progress
: do not report progress
-
filters:
--size-leq INT
:size <= value
--size-geq INT
:size >= value
--sha256-leq HEX
:sha256 <= from_hex(value)
--sha256-geq HEX
:sha256 >= from_hex(value)
-
subcommands:
{index,find,find-duplicates,find-dupes,deduplicate,verify,fsck,upgrade}
index
: index given filesystem trees and record results in aDATABASE
find
: print paths of indexed files matching specified criteriafind-duplicates (find-dupes)
: print groups of duplicated indexed files matching specified criteriadeduplicate
: produce groups of duplicated indexed files matching specified criteria, and then deduplicate themverify (fsck)
: verify that the index matches the filesystemupgrade
: backup theDATABASE
and then upgrade it to latest format
Recursively walk given INPUT
s and update the DATABASE
to reflect them.
- For each
INPUT
, walk it recursively (both in the filesystem and in theDATABASE
), for each walkedpath
:-
if it is present in the filesystem but not in the
DATABASE
,- if
--no-add
is set, do nothing, - otherwise, index it and add it to the
DATABASE
;
- if
-
if it is not present in the filesystem but present in the
DATABASE
,- if
--no-remove
is set, do nothing, - otherwise, remove it from the
DATABASE
;
- if
-
if it is present in both,
- if
--no-update
is set, do nothing, - if
--verify
is set, verify it as ifhoardy verify $path
was run, - if
--checksum
is set or if filetype
,size
, ormtime
changed,- re-index the file and update the
DATABASE
record, - otherwise, do nothing.
- re-index the file and update the
- if
-
-
positional arguments:
INPUT
: input files and/or directories to process
-
options:
-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown--stdin0
: read zero-terminatedINPUT
s from stdin, these will be processed after allINPUTS
s specified as command-line arguments
-
output:
-v, --verbose
: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose
: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated
: print output lines terminated with\n
(LF) newline characters; default-z, --zero-terminated, --print0
: print output lines terminated with\0
(NUL) bytes, implies--no-color
and zero verbosity
-
content hashing:
--checksum
: re-hash everything; i.e., assume that some files could have changed contents without changingtype
,size
, ormtime
--no-checksum
: skip hashing if filetype
,size
, andmtime
matchDATABASE
record; default
-
index how:
--add
: for files present in the filesystem but not yet present in theDATABASE
, index and add them to theDATABASE
; note that new files will be hashed even if--no-checksum
is set; default--no-add
: ignore previously unseen files--remove
: for files that vanished from the filesystem but are still present in theDATABASE
, remove their records from theDATABASE
; default--no-remove
: do not remove vanished files from the database--update
: for files present both on the filesystem and in theDATABASE
, if a file appears to have changed on disk (changedtype
,size
, ormtime
), re-index it and write its updated record to theDATABASE
; note that changed files will be re-hashed even if--no-checksum
is set; default--no-update
: skip updates for all files that are present both on the filesystem and in theDATABASE
--reindex
: an alias for--update --checksum
: for all files present both on the filesystem and in theDATABASE
, re-index them and then updateDATABASE
records of files that actually changed; i.e. re-hash files even if they appear to be unchanged--verify
: proceed like--update
does, but do not update any records in theDATABASE
; instead, generate errors if newly generated records do not match those already in theDATABASE
--reindex-verify
: an alias for--verify --checksum
: proceed like--reindex
does, but then--verify
instead of updating theDATABASE
-
record what:
-
--ino
: record inode numbers reported bystat
into theDATABASE
; default -
--no-ino
: ignore inode numbers reported bystat
, recording them all as0
s; this will forcehoardy
to ignore inode numbers in metadata checks and process such files as if each path is its own inode when doing duplicate search;on most filesystems, the default `--ino` will do the right thing, but this option needs to be set explicitly when indexing files from a filesystem which uses dynamic inode numbers (`unionfs`, `sshfs`, etc); otherwise, files indexed from such filesystems will be updated on each re-`index` and `find-duplicates`, `deduplicate`, and `verify` will always report them as having broken metadata
-
Print paths of files under INPUT
s that match specified criteria.
- For each
INPUT
, walk it recursively (in theDATABASE
), for each walkedpath
:- if the
path
and/or the file associated with that path matches specified filters, print thepath
; - otherwise, do nothing.
- if the
-
positional arguments:
INPUT
: input files and/or directories to process
-
options:
-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown--stdin0
: read zero-terminatedINPUT
s from stdin, these will be processed after allINPUTS
s specified as command-line arguments--porcelain
: print outputs in a machine-readable format
-
output:
-v, --verbose
: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose
: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated
: print output lines terminated with\n
(LF) newline characters; default-z, --zero-terminated, --print0
: print output lines terminated with\0
(NUL) bytes, implies--no-color
and zero verbosity
Print groups of paths of duplicated files under INPUT
s that match specified criteria.
-
For each
INPUT
, walk it recursively (in theDATABASE
), for each walkedpath
:- get its
group
, which is a concatenation of itstype
,sha256
hash, and all metadata fields for which a corresponding--match-*
options are set; e.g., with--match-perms --match-uid
, this produces a tuple oftype, sha256, mode, uid
; - get its
inode_id
, which is a tuple ofdevice_number, inode_number
for filesystems which reportinode_number
s and a uniqueint
otherwise; - record this
inode
's metadata andpath
as belonging to thisinode_id
; - record this
inode_id
as belonging to thisgroup
.
- get its
-
For each
group
, for eachinode_id
ingroup
:- sort
path
s as--order-paths
says, - sort
inodes
s as--order-inodes
says.
- sort
-
For each
group
, for eachinode_id
ingroup
, for eachpath
associated toinode_id
:- print the
path
.
- print the
Also, if you are reading the source code, note that the actual implementation of this command is a bit more complex than what is described above.
In reality, there's also a pre-computation step designed to filter out single-element group
s very early, before loading of most of file metadata into memory, thus allowing hoardy
to process groups incrementally, report its progress more precisely, and fit more potential duplicates into RAM.
In particular, this allows hoardy
to work on DATABASE
s with hundreds of millions of indexed files on my 2013-era laptop.
With the default verbosity, this command simply prints all path
s in resulting sorted order.
With verbosity of 1
(a single --verbose
), each path
in a group
gets prefixed by:
__
, if it is the firstpath
associated to aninode
, i.e., this means thispath
introduces a previously unseeninode
,=>
, otherwise, i.e., this means that thispath
is a hardlink to the path last marked with__
.
With verbosity of 2
, each group
gets prefixed by a metadata line.
With verbosity of 3
, each path
gets prefixed by associated inode_id
.
With the default spacing of 1
a new line gets printed after each group
.
With spacing of 2
(a single --spaced
) a new line also gets printed after each inode
.
-
positional arguments:
INPUT
: input files and/or directories to process
-
options:
-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown--stdin0
: read zero-terminatedINPUT
s from stdin, these will be processed after allINPUTS
s specified as command-line arguments
-
output:
-v, --verbose
: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose
: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated
: print output lines terminated with\n
(LF) newline characters; default-z, --zero-terminated, --print0
: print output lines terminated with\0
(NUL) bytes, implies--no-color
and zero verbosity--spaced
: print more empty lines between different parts of the output; can be specified multiples--no-spaced
: print less empty lines between different parts of the output; can be specified multiples
-
duplicate file grouping defaults:
--match-meta
: set defaults to--match-device --match-permissions --match-owner --match-group
--ignore-meta
: set defaults to--ignore-device --ignore-permissions --ignore-owner --ignore-group
; default--match-extras
: set defaults to--match-xattrs
--ignore-extras
: set defaults to--ignore-xattrs
; default--match-times
: set defaults to--match-last-modified
--ignore-times
: set defaults to--ignore-last-modified
; default
-
duplicate file grouping; consider same-content files to be duplicates when they...:
--match-size
: ... have the same file size; default--ignore-size
: ... regardless of file size; only useful for debugging or discovering hash collisions--match-argno
: ... were produced by recursion from the same command-line argument (which is checked by comparingINPUT
indexes inargv
, if the path is produced by several different arguments, the smallest one is taken)--ignore-argno
: ... regardless of whichINPUT
they came from; default--match-device
: ... come from the same device/mountpoint/drive--ignore-device
: ... regardless of devices/mountpoints/drives; default--match-perms, --match-permissions
: ... have the same file modes/permissions--ignore-perms, --ignore-permissions
: ... regardless of file modes/permissions; default--match-owner, --match-uid
: ... have the same owner id--ignore-owner, --ignore-uid
: ... regardless of owner id; default--match-group, --match-gid
: ... have the same group id--ignore-group, --ignore-gid
: ... regardless of group id; default--match-last-modified, --match-mtime
: ... have the samemtime
--ignore-last-modified, --ignore-mtime
: ... regardless ofmtime
; default--match-xattrs
: ... have the same extended file attributes--ignore-xattrs
: ... regardless of extended file attributes; default
-
sharding:
-
--shard FROM/TO/SHARDS|SHARDS|NUM/SHARDS
: split database into a number of disjoint pieces (shards) and process a range of them:- with
FROM/TO/SHARDS
specified, split database intoSHARDS
shards and then process those with numbers betweenFROM
andTO
(both including, counting from1
); - with
SHARDS
syntax, interpret it as1/SHARDS/SHARDS
, thus processing the whole database by splitting it intoSHARDS
pieces first; - with
NUM/SHARDS
, interpret it asNUM/NUM/SHARDS
, thus processing a single shardNUM
ofSHARDS
; - default:
1/1/1
,1/1
, or just1
, which processes the whole database as a single shard;
- with
-
-
--order-*
defaults:--order {mtime,argno,abspath,dirname,basename}
: set all--order-*
option defaults to the given value, except specifying--order mtime
will set the default--order-paths
toargno
instead (since all of the paths belonging to the sameinode
have the samemtime
); default:mtime
-
order of elements in duplicate file groups:
-
--order-paths {argno,abspath,dirname,basename}
: in eachinode
info record, orderpath
s by:argno
: the correspondingINPUT
's index inargv
, if apath
is produced by several different arguments, the index of the first of them is used; defaultabspath
: absolute file pathdirname
: absolute file path without its last componentbasename
: the last component of absolute file path
-
--order-inodes {mtime,argno,abspath,dirname,basename}
: in each duplicate filegroup
, orderinode
info records by:argno
: same as--order-paths argno
mtime
: file modification time; defaultabspath
: same as--order-paths abspath
dirname
: same as--order-paths dirname
basename
: same as--order-paths basename
When an
inode
has several associatedpath
s, sorting byargno
,abspath
,dirname
, andbasename
is performed by taking the smallest of the respective values.For instance, a duplicate file
group
that looks like the following when ordered with--order-inodes mtime --order-paths abspath
:__ 1/3 => 1/4 __ 2/5 => 2/6 __ 1/2 => 2/1
will look like this, when ordered with
--order-inodes basename --order-paths abspath
:__ 1/2 => 2/1 __ 1/3 => 1/4 __ 2/5 => 2/6
-
--reverse
: when sorting, invert all comparisons
-
-
duplicate file group filters:
--min-paths MIN_PATHS
: only process duplicate file groups with at least this manypath
s; default:2
--min-inodes MIN_INODES
: only process duplicate file groups with at least this manyinodes
; default:2
Produce groups of duplicated indexed files matching specified criteria, similar to how find-duplicates
does, except with much stricter default --match-*
settings, and then deduplicate the resulting files by hardlinking them to each other.
-
Proceed exactly as
find-duplicates
does in its step 1. -
Proceed exactly as
find-duplicates
does in its step 2. -
For each
group
:- assign the first
path
of the firstinode_id
assource
, - print
source
, - for each
inode_id
ingroup
, for eachinode
andpath
associated to aninode_id
:- check that
inode
metadata matches filesystems metadata ofpath
,- if it does not, print an error and skip this
inode_id
,
- if it does not, print an error and skip this
- if
source
, continue with otherpath
s; - if
--paranoid
is set or if this the very firstpath
ofinode_id
,- check whether file data/contents of
path
matches file data/contents ofsource
,- if it does not, print an error and skip this
inode_id
,
- if it does not, print an error and skip this
- check whether file data/contents of
- if
--hardlink
is set, hardlinksource -> path
, - if
--delete
is set,unlink
thepath
, - update the
DATABASE
accordingly.
- check that
- assign the first
The verbosity and spacing semantics are similar to the ones used by find-duplicates
, except this command starts at verbosity of 1
, i.e. as if a single --verbose
is specified by default.
Each processed path
gets prefixed by:
__
, if this is the very firstpath
in agroup
, i.e. this is asource
,- when
--hardlink
ing:=>
, if this is a non-source
path
associated to the firstinode
, i.e. it's already hardlinked tosource
on disk, thus processing of thispath
was skipped,ln
, if thispath
was successfully hardlinked (to an equalsource
),
- when
--delete
ing:rm
, if thispath
was successfully deleted (while an equalsource
was kept),
fail
, if there was an error while processing thispath
(which will be reported tostderr
).
-
positional arguments:
INPUT
: input files and/or directories to process
-
options:
-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown--stdin0
: read zero-terminatedINPUT
s from stdin, these will be processed after allINPUTS
s specified as command-line arguments
-
output:
-v, --verbose
: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose
: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated
: print output lines terminated with\n
(LF) newline characters; default-z, --zero-terminated, --print0
: print output lines terminated with\0
(NUL) bytes, implies--no-color
and zero verbosity--spaced
: print more empty lines between different parts of the output; can be specified multiples--no-spaced
: print less empty lines between different parts of the output; can be specified multiples
-
duplicate file grouping defaults:
--match-meta
: set defaults to--match-device --match-permissions --match-owner --match-group
; default--ignore-meta
: set defaults to--ignore-device --ignore-permissions --ignore-owner --ignore-group
--match-extras
: set defaults to--match-xattrs
; default--ignore-extras
: set defaults to--ignore-xattrs
--match-times
: set defaults to--match-last-modified
--ignore-times
: set defaults to--ignore-last-modified
; default
-
duplicate file grouping; consider same-content files to be duplicates when they...:
--match-size
: ... have the same file size; default--ignore-size
: ... regardless of file size; only useful for debugging or discovering hash collisions--match-argno
: ... were produced by recursion from the same command-line argument (which is checked by comparingINPUT
indexes inargv
, if the path is produced by several different arguments, the smallest one is taken)--ignore-argno
: ... regardless of whichINPUT
they came from; default--match-device
: ... come from the same device/mountpoint/drive; default--ignore-device
: ... regardless of devices/mountpoints/drives--match-perms, --match-permissions
: ... have the same file modes/permissions; default--ignore-perms, --ignore-permissions
: ... regardless of file modes/permissions--match-owner, --match-uid
: ... have the same owner id; default--ignore-owner, --ignore-uid
: ... regardless of owner id--match-group, --match-gid
: ... have the same group id; default--ignore-group, --ignore-gid
: ... regardless of group id--match-last-modified, --match-mtime
: ... have the samemtime
--ignore-last-modified, --ignore-mtime
: ... regardless ofmtime
; default--match-xattrs
: ... have the same extended file attributes; default--ignore-xattrs
: ... regardless of extended file attributes
-
sharding:
-
--shard FROM/TO/SHARDS|SHARDS|NUM/SHARDS
: split database into a number of disjoint pieces (shards) and process a range of them:- with
FROM/TO/SHARDS
specified, split database intoSHARDS
shards and then process those with numbers betweenFROM
andTO
(both including, counting from1
); - with
SHARDS
syntax, interpret it as1/SHARDS/SHARDS
, thus processing the whole database by splitting it intoSHARDS
pieces first; - with
NUM/SHARDS
, interpret it asNUM/NUM/SHARDS
, thus processing a single shardNUM
ofSHARDS
; - default:
1/1/1
,1/1
, or just1
, which processes the whole database as a single shard;
- with
-
-
--order-*
defaults:--order {mtime,argno,abspath,dirname,basename}
: set all--order-*
option defaults to the given value, except specifying--order mtime
will set the default--order-paths
toargno
instead (since all of the paths belonging to the sameinode
have the samemtime
); default:mtime
-
order of elements in duplicate file groups; note that unlike with
find-duplicates
, these settings influence not only the order they are printed, but also which files get kept and which get replaced with--hardlink
s to kept files or--delete
d:-
--order-paths {argno,abspath,dirname,basename}
: in eachinode
info record, orderpath
s by:argno
: the correspondingINPUT
's index inargv
, if apath
is produced by several different arguments, the index of the first of them is used; defaultabspath
: absolute file pathdirname
: absolute file path without its last componentbasename
: the last component of absolute file path
-
--order-inodes {mtime,argno,abspath,dirname,basename}
: in each duplicate filegroup
, orderinode
info records by:argno
: same as--order-paths argno
mtime
: file modification time; defaultabspath
: same as--order-paths abspath
dirname
: same as--order-paths dirname
basename
: same as--order-paths basename
When an
inode
has several associatedpath
s, sorting byargno
,abspath
,dirname
, andbasename
is performed by taking the smallest of the respective values.For instance, a duplicate file
group
that looks like the following when ordered with--order-inodes mtime --order-paths abspath
:__ 1/3 => 1/4 __ 2/5 => 2/6 __ 1/2 => 2/1
will look like this, when ordered with
--order-inodes basename --order-paths abspath
:__ 1/2 => 2/1 __ 1/3 => 1/4 __ 2/5 => 2/6
-
--reverse
: when sorting, invert all comparisons
-
-
duplicate file group filters:
--min-paths MIN_PATHS
: only process duplicate file groups with at least this manypath
s; default:2
--min-inodes MIN_INODES
: only process duplicate file groups with at least this manyinodes
; default:2
when--hardlink
is set,1
when --delete` is set
-
deduplicate how:
--hardlink, --link
: deduplicate duplicated file groups by replacing all but the very first file in each group with hardlinks to it (hardlinks go from destination file to source file); see the "Algorithm" section above for a longer explanation; default--delete, --unlink
: deduplicate duplicated file groups by deleting all but the very first file in each group; see--order*
options for how to influence which file would be the first--sync
: batch changes, apply them right before commit,fsync
all affected directories, and only then commit changes to theDATABASE
; this way, after a power loss, the nextdeduplicate
will at least notice those files being different from their records; default--no-sync
: perform all changes eagerly withoutfsync
ing anything, commit changes to theDATABASE
asynchronously; not recommended unless your machine is powered by a battery/UPS; otherwise, after a power loss, theDATABASE
will likely be missing records about files that still exists, i.e. you will need to re-index
allINPUTS
to make the database state consistent with the filesystems again
-
before
--hardlink
ing or--delete
ing a target, check that source and target...:--careful
: ... inodes have equal data contents, once for each new inode; i.e.check that source and target have the same data contents as efficiently as possible; assumes that no files change whilehoardy
is running--paranoid
: ... paths have equal data contents, for each pair of them; this can be slow --- though it is usually not --- but it guarantees thathoardy
won't loose data even if other internal functions are buggy; it will also usually, though not always, prevent data loss if files change whilehoardy
is running, see "Quirks and Bugs" section of theREADME.md
for discussion; default
Verfy that indexed files from under INPUT
s that match specified criteria exist on the filesystem and their metadata and hashes match filesystem contents.
- For each
INPUT
, walk it recursively (in the filesystem), for each walkedpath
:- fetch its
DATABASE
record, - if
--checksum
is set or if filetype
,size
, ormtime
is different from the one in theDATABASE
record,- re-index the file,
- for each field:
- if its value matches the one in
DATABASE
record, do nothing; - otherwise, if
--match-<field>
option is set, print an error; - otherwise, print a warning.
- if its value matches the one in
- fetch its
This command runs with an implicit --match-sha256
option which can not be disabled, so hash mismatches always produce errors.
-
positional arguments:
INPUT
: input files and/or directories to process
-
options:
-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown--stdin0
: read zero-terminatedINPUT
s from stdin, these will be processed after allINPUTS
s specified as command-line arguments
-
output:
-v, --verbose
: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose
: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated
: print output lines terminated with\n
(LF) newline characters; default-z, --zero-terminated, --print0
: print output lines terminated with\0
(NUL) bytes, implies--no-color
and zero verbosity
-
content verification:
--checksum
: verify all file hashes; i.e., assume that some files could have changed contents without changingtype
,size
, ormtime
; default--no-checksum
: skip hashing if filetype
,size
, andmtime
matchDATABASE
record
-
verification defaults:
--match-meta
: set defaults to--match-permissions
; default--ignore-meta
: set defaults to--ignore-permissions
--match-extras
: set defaults to--match-xattrs
; default--ignore-extras
: set defaults to--ignore-xattrs
--match-times
: set defaults to--match-last-modified
--ignore-times
: set defaults to--ignore-last-modified
; default
-
verification; consider a file to be
ok
when it and itsDATABASE
record...:--match-size
: ... have the same file size; default--ignore-size
: ... regardless of file size; only useful for debugging or discovering hash collisions--match-perms, --match-permissions
: ... have the same file modes/permissions; default--ignore-perms, --ignore-permissions
: ... regardless of file modes/permissions--match-last-modified, --match-mtime
: ... have the samemtime
--ignore-last-modified, --ignore-mtime
: ... regardless ofmtime
; default
Backup the DATABASE
and then upgrade it to latest format.
This exists for development purposes.
You don't need to call this explicitly as, normally, database upgrades are completely automatic.
- options:
-h, --help
: show this help message and exit--markdown
: show--help
formatted in Markdown
-
Index all files in
/backup
:hoardy index /backup
-
Search paths of files present in
/backup
:hoardy find /backup | grep something
-
List all duplicated files in
/backup
, i.e. list all files in/backup
that have multiple on-disk copies with same contents but using different inodes:hoardy find-dupes /backup | tee dupes.txt
-
Same as above, but also include groups consisting solely of hardlinks to the same inode:
hoardy find-dupes --min-inodes 1 /backup | tee dupes.txt
-
Produce exactly the same duplicate file groups as those the following
deduplicate
would use by default:hoardy find-dupes --match-meta /backup | tee dupes.txt
-
Deduplicate
/backup
by replacing files that have exactly the same metadata and contents (but with anymtime
) with hardlinks to a file with the earliest knownmtime
in each such group:hoardy deduplicate /backup
-
Deduplicate
/backup
by replacing same-content files larger than 1 KiB with hardlinks to a file with the latestmtime
in each such group:hoardy deduplicate --size-geq 1024 --reverse --ignore-meta /backup
This plays well with directories produced by
rsync --link-dest
andrsnapshot
. -
Similarly, but for each duplicate file group use a file with the largest absolute path (in lexicographic order) as the source for all generated hardlinks:
hoardy deduplicate --size-geq 1024 --ignore-meta --reverse --order-inodes abspath /backup
-
When you have enough indexed files that a run of
find-duplicates
ordeduplicate
stops fitting into RAM, you can process your database piecemeal by sharding bySHA256
hash digests:# shard the database into 4 pieces and then process each piece separately hoardy find-dupes --shard 4 /backup hoardy deduplicate --shard 4 /backup # assuming the previous command was interrupted in the middle, continue from shard 2 of 4 hoardy deduplicate --shard 2/4/4 /backup # shard the database into 4 pieces, but only process the first one of them hoardy deduplicate --shard 1/4 /backup # uncertain amounts of time later... # (possibly, after a reboot) # process piece 2 hoardy deduplicate --shard 2/4 /backup # then piece 3 hoardy deduplicate --shard 3/4 /backup # or, equivalently, process pieces 2 and 3 one after the other hoardy deduplicate --shard 2/3/4 /backup # uncertain amounts of time later... # process piece 4 hoardy deduplicate --shard 4/4 /backup
With
--shard SHARDS
set,hoardy
takes about1/SHARDS
amount of RAM, but produces exactly the same result as if you had enough RAM to run it with the default--shard 1
, except it prints/deduplicates duplicate file groups in pseudo-randomly different order and trades RAM usage for longer total run time. -
Alternatively, you can shard the database manually with filters:
# deduplicate files larger than 100 MiB hoardy deduplicate --size-geq 104857600 /backup # deduplicate files between 1 and 100 MiB hoardy deduplicate --size-geq 1048576 --size-leq 104857600 /backup # deduplicate files between 16 bytes and 1 MiB hoardy deduplicate --size-geq 16 --size-leq 1048576 /backup # deduplicate about half of the files hoardy deduplicate --sha256-leq 7f /backup # deduplicate the other half hoardy deduplicate --sha256-geq 80 /backup
The
--shard
option does something very similar to the latter example.
Sanity check and test hoardy
command-line interface.
-
Run internal tests:
./test-hoardy.sh default
-
Run fixed-output tests on a given directory:
./test-hoardy.sh ~/rarely-changing-path
This will copy the whole contents of that path to
/tmp
first.