-
Notifications
You must be signed in to change notification settings - Fork 0
miellaby/backunion
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
INTRODUCTION There are many technologies related to file deduping: filesystems like btrfs, lessfs, s3ql or version managers like git. But none of them looks as simple and yet as versatile as "backunion". SIMPLE AND VERSATILE "backunion" relays on an union mount. If you understand what an union mount does, you will understand how "backunion" works. It simply dedups files from the writable branch to the lower branch of a classic 2 layers union: "in-out=RW:store/tree=RO". "backunion" only weights ~200 instructions and combines well known command-line tools like "find", "md5sum" and "ln". It doesn't even need for a database. And yet "backunion" allows to backup many clients on a single server by performing asynchronous global deduping on both client and server sides. * one may run "backunion" at server side and share the union mount via regular file protocols suitable for closed or thin clients. * one may run "backunion" on each client so to deduplicate files up to a central shared "store" folder (hard-link support is the minimal requirement for such a server). * Client side deduping -for open and powerful devices- and server side deduping can be performed simultaneously (n clients + 1 server) on the same "store" folder. * Transparent. The end user may access and modify deduped files without side effect thanks to the copy-on-write feature of union mounts. * And even more: Postponable deduping, 2d level backup/sync aware, ... WHAT IS GLOBAL CLIENT SIDE DEDUPING? rsync is know to be the ideal file backup/synchronisation tool. And yet it will send again a whole file every time it is renamed or moved into an other folder. Renaming a folder will even lead to recursivly upload all its content. Installing a deduping filesystem like lessfs, btrfs or ZFS to your server won't change anything to this issue. Files are eventually deduplicated but redondant uploading will still occur. Client-side deduping consists in hashing files so to identify and avoid such useless uploads. Global deduping will even avoid uploading a file which has been already saved by another client whatever its former name. s3ql is a typical example of this client-side deduping approach. However, a given s3ql backend can't be mounted several times. Only one machine at a time can browse the s3ql store. s3ql data are so opaque to the server it can't even publish back deduped files via standard protocols like CIFS, FTP, DAAP or DLNA. "git" as a deduped file storage may work well, but its major drawback is that files are copied twice on each client because of git local file changes tracing feature. HOW TO USE "BACKUNION"? "backunion" is simply designed. No parameters. No surprise. If you want customisation, well edit the code. How to use "backunion" on a single machine? 1) put the script into a backup folder and run it once. ==> 3 subfolders "./in-out", "./store" and "./union" are created. ==> An union filesystem is mounted to "./union" and merges "./in-out" with "./store/tree". 2) Fill up "./union" with your files. ==> Your data are actually stored in "./in-out". 3) Run the script again and Voilà! all your files were deduped to "./store/data" and hard-linked into "./store/tree". ==> the union mount is still up and ready. It offers a consistent writable files tree. *) You can change files within the union mount and run "backunion" again to deduplicate these changes. It will work as expected: A change won't be propagated to all instances of a deduped file. How to share a "backunion" deduping folder? Run "backunion" at least once and share the obtained "./union" folder via any suited file protocol (eg. CIFS). Run "backunion" again to deduplicate uploaded client files. This process can be postponed to low activity periods. How to allow client-side deduping? To allow bandwith-efficient client-side deduping, share the "./store" folder obtained by a previous invocation of "backunion" via a file protocol allowing hard links (eg. NFS). Then simply mount the shared folder on each client in place of the default "./store" folder. That's all. Running "backunion" on a client performs file deduping on client side and upload only new files to the store. SO, HOW DOES BACKUNION WORK? "backunion" relays on an union mount merging an "in-out" writable folder on top of a "store/tree" read-only folder. It leverages on the "copy-on-write" feature (aka "copy-up") of union mounts. Note there is several union mount implementation on Linux. At the moment writing, "backunion" relays on unionfs-fuse (hard-coded). "backunion" eliminates duplicates by: - moving all files from the "in-out" folder to "store/data" by renaming them according to their own checksums. - postponing costly crypto checksums as long as possible (à la FSlint). - hard-linking renamed files to their cloned locations within the "store/tree" folder. "backunion" is as simple as possible but not too much as it manages: * file deletion ("whiteouts" translated into link removals) * garbage collector (unreferred data removed from the store) * tricky updates (eg. when a single file takes place of a full folder) * opened files (not deduped) * concurrent deduping processes/clients thanks to (fine enough) locks "backunion" won't manage: * 2 deduping process within the same client. Should not be. Period. * when clients conflict about a file named like a folder. * dedup miss because of a concurrent GC wrongly removing a content from the store. Not critical. BENEFITS OF THE UNION MOUNT IDEA The union mount merges "in-out" and "store/tree" folders together, so the deduping process can be proceed in background while being transparent to the end user. The union mount performs "copy on write" when needed so the end user may still modify deduped files without messing the store (typical unwanted effects of hard-link deduping). Plus the "copy up" behavior separates modified files from the unmodified deduped files in the read-only branch so that invoking the script again will only dedup touched files. Please note that directly messing with branches of an union mount is known to be risky. However "backunion" does it in such a way that all is OK (but be careful with race conditions listed hereafter). PENDING ISSUES AND ROOM FOR FUTURE IMPROVEMENTS The GC may remove a file put by a concurrent client. It only leads to a dedup miss. Not critical. 2 deduping clients could conflict because one puts a file in place where the other puts a folder. I'm still thinking about the most simple way to handle this. Don't launch 2 processes deduping the same in-out.
About
Poor man but versatile deduping system
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published