Presents files as deduplicated, content-addressed 1MB chunks with selectable hash algorithms.
DedupeFS is a FUSE filesystem over my Crazy Deduper application. It is so to speak the logical successor of SCFS. While SCFS presented each file as chunks, independent of each other, DedupeFS calculates the checksum of each chunk and collects them all in one directory. That way, each unique chunk is only presented once, even if it is used by multiple files.
DedupeFS is mainly useful to create efficient backups and upload them to a cloud provider. The file chunks have the advantage that the upload does not have to be all-or-nothing, so if your internet connection vanishes for a second, your 4GB file upload will not be completely cancelled, only the currently transferred chunk upload will be aborted.
By keeping multiple cache files around, you can easily and efficiently have incremental backups that all share the same chunks.
This tool can be installed easily through Cargo via crates.io:
cargo install --locked dedupefsPlease note that the --locked flag is necessary here to have the exact same dependencies as when the application was
tagged and tested. Without it, you might get more up-to-date versions of dependencies, but you have the risk of
undefined and unexpected behavior if the dependencies changed some functionalities. The application might even fail to
build if the public API of a dependency changed too much.
Alternatively, pre-built binaries can be downloaded from the GitHub releases page.
Usage: dedupefs [OPTIONS] <SOURCE> <MOUNTPOINT>
Arguments:
<SOURCE>
Source directory
<MOUNTPOINT>
Mount point
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. The first given will be written.
--hashing-algorithm <HASHING_ALGORITHM>
Hashing algorithm to use for chunk filenames
[default: sha1]
[possible values: md5, sha1, sha256, sha512]
-f, --foreground
Stay in foreground, do not daemonize into the background
--declutter-levels <DECLUTTER_LEVELS>
Declutter files into this many subdirectory levels
[default: 3]
--reverse
Reverse mode, present chunks re-hydrated
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
To mount a deduped version of source directory to deduped, you can use:
dedupefs --cache-file cache.json.zst source dedupedIf the cache file ends with .zst, it will be encoded (or decoded in the case of hydrating) using the ZSTD compression
algorithm. For any other extension, plain JSON will be used.
To mount a re-hydrated version of deduped directory to restored, you can use:
dedupefs --reverse --cache-file cache.json.zst deduped restoredBefore mounting, it will be checked if all chunks present in the cache file are available in the deduped/data
directory. If not, the mount will fail.
The cache file is necessary to keep track of all file chunks and hashes. Without the cache you would not be able to restore your files.
The cache file can be re-used, even if the source directory changed. It keeps track of the file sizes and modification times and only re-hashes new or changed files. Deleted files are deleted from the cache.
You can also use older cache files in addition to a new one:
dedupefs --cache-file cache.json.zst --cache-file cache-from-yesterday.json.zst source dedupedThe cache files are read in reverse order in which they are given on the command line, so the content of earlier cache files is preferred over later ones. Hence, you should put your most accurate cache files to the beginning. Moreover, the first given cache file is the one that will be written to, it does not need to exist.
In the given example, if cache.json.zst does not exist, the internal cache is pre-filled from
cache-from-yesterday.json.zst so that only new and modified files need to be re-hashed. The result is then written
into cache.json.zst.
In the mounted deduped directory, the first cache file given on the command line will be presented with the same name directly under the mountpoint. next to the data directory. When uploading your chunks, always make sure to also upload this cache file, otherwise you wil not be able to properly re-hydrate your files afterward!
There are several helper commands available to work with the cache files and to inspect the internal state of the deduplicated data chunks:
Check if cache file is valid and all chunks exist.
Usage: dedupefs_check_cache [OPTIONS] <SOURCE>
Arguments:
<SOURCE>
Source directory to deduped files
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.
--declutter-levels <DECLUTTER_LEVELS>
Declutter files into this many subdirectory levels
[default: 3]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Only create cache file without actually mounting.
Usage: dedupefs_create_cache [OPTIONS] <SOURCE>
Arguments:
<SOURCE>
Source directory
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. The first given will be written.
--hashing-algorithm <HASHING_ALGORITHM>
Hashing algorithm to use for chunk filenames
[default: sha1]
[possible values: md5, sha1, sha256, sha512]
--declutter-levels <DECLUTTER_LEVELS>
Declutter files into this many subdirectory levels
[default: 3]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Delete files not present in any cache files.
Usage: dedupefs_delete_extra_files [OPTIONS] <SOURCE>
Arguments:
<SOURCE>
Source directory
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.
-v
List deleted files
-f
Force deletion
--declutter-levels <DECLUTTER_LEVELS>
Declutter files into this many subdirectory levels
[default: 3]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
List files not present in any cache files.
Usage: dedupefs_list_extra_files [OPTIONS] <SOURCE>
Arguments:
<SOURCE>
Source directory
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.
-0
Separate file names with null character instead of newline
--declutter-levels <DECLUTTER_LEVELS>
Declutter files into this many subdirectory levels
[default: 3]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
List chunks from cache files that are not present in the source directory.
Usage: dedupefs_list_missing_chunks [OPTIONS] <SOURCE>
Arguments:
<SOURCE>
Source directory
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.
--with-reason
Also display the reason for the missing or invalid chunk
-0
Separate file names with null character instead of newline
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
- Make chunk size configurable (via Crazy Deduper, fixed to 1MB at the moment).
- Provide better documentation with examples and use case descriptions.