-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZIM on IPFS: maximizing deduplication #71
Comments
@lidel sorry for not responding in time here. Yes this is precisely what |
Using ZIM files directly still has the issue of the built-in clustering and compression likely reducing the dedupe power. Anything we can reap from running a rolling hash on an opaque binary is, to some extent, coincidental. Sure, cutting the bandwidth down by a half is nice, but we could do better. It would be nice to get some sort of guarantee from kiwix that they are packing entries into clusters in some stable manner like article ID. But eh, the guys at libzim don't want to make that possible (openzim/libzim#83). They don't even want to tell us the cluster size for chunk tuning. |
@Artoria2e5 In an attempt to phrase it for someone which is not familiar with IPFS, you would like to have a lot of clusters which are binary identical between ZIM files (or at least between many ZIM versions over time of the same content)? |
Exactly. (Rsync wants the same too.) IPFS's tree stores a file as a set of references to chunks that are hashed separately. The chunking can be purely size-based, or it can be based on a rolling hash like rsync does. With rolling hashes we can make sure additions or insertions don't mess up the chunking of otherwise identical data. If we are to apply a rolling hash to ZIM, the target chunk size should approximately match ZIM's compressed cluster size, so that a change in one cluster only affects around two chunks. The stability requirement is mostly to restrict the number of changes. Having a chunk size different from the cluster size is totally okay. Making it higher means some extra data being required for recording changes, while making it lower only helps when the data is uncompressed or using "rsyncable" compression with a smaller block size. |
@Artoria2e5 Thank you for the clarification. To me the discussion about the format or even only how we (should) save things in the ZIM files is a pretty complex one, and in any case an optimisation. Here I would like to make things in the proper order, which is:
Considering that IPFS offers many other advantages over HTTP+zsync, even if it would be a bit lower, this would be still an serious incencitive to setup a seriously plan to provide all our ZIM on IPFS and then have a project to provide incremental updates based on it. I have tomorrow a meeting with @lidel, will talk about that again. |
I put two versions of zh.wikipedia all maxi onto ipfs as:
The two sizes are 17843989064 and 17633450554. My shell script thinks 17628509273 got reused, which is 99.97%, a really huge amount. Hashed enwp to |
@Artoria2e5 Thank you for the benchmark, but I believe I can not understand it properly. Please excuse my ignorance but:
|
|
OK
Does "99.97% the size of the old file." means they share 99.97% of the chunks?
OK, so roughly how much data needs to be download to get the new file if we have already the old?
You have published it?
|
Looks like my daemon just crashed :/ |
Hi folks, seeing how you are trying to move this forward I pushed an incompleteversion of 🗡️ that will do what you need ( please be kind, the helptexts are not entirely finished )
Should get you going. Replace If you want to import the results into ipfs, you need to: /cc @lidel |
@kelson42 I believe @Artoria2e5 interesting! Got some questions:
$ bin/stream-repack-multipart wikipedia_zh_all_maxi_2020-06.zim wikipedia_zh_all_maxi_2020-07.zim | bin/stream-dagger --multipart --ipfs-add-compatible-command="--cid-version=1 --chunker=buzhash --hash=blake2b-256"
{"event": "root", "payload": 17628509273, "stream": 1, "cid":"bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm", "wiresize": 17633450554 }
{"event": "root", "payload": 17838985507, "stream": 2, "cid":"bafykbzacea5d2c2lxcdbgja2vnswe5aembmwvzqh6an4wzix3wbxkbkkkevdq", "wiresize": 17843989064 }
Ran on 4-core/8-thread Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz
Processing took 129.80 seconds using 0.89 vCPU and 180.59 MiB peak memory
Performing 879,752 system reads using 0.14 vCPU at about 260.60 MiB/s
Ingesting payload of: 35,467,494,780 bytes from 2 substreams
Forming DAG covering: 35,477,439,618 bytes of 190,994 logical nodes
Dataset deduped into: 35,169,131,081 bytes over 188,654 unique leaf nodes
Linked as streams by: 9,944,838 bytes over 1,103 unique DAG-PB nodes
Taking a grand-total: 35,179,075,919 bytes, 99.19% of original, 1.0x smaller
Roots\Counts\Sizes: 3% 10% 25% 50% 95% | Avg
{2} 2 L3: 234 234 | 234
8 L2: 1,144 9,407 9,407 9,407 | 7,388
1,093 L1: 9,058 9,058 9,058 9,058 9,058 | 9,044
188,654 DB: 131,613 132,956 136,219 145,597 400,991 | 186,421 |
Now that I checked my scropt again, my is-block-seen test is broken. Of course it is! I forgot a bit of buzhash + blake2b is mainly due to my poor CPU. I want something that is not awful but also not going to freeze an old mac mini. This thing's vnc can go down from downloading a file too fast… |
Context
📦 Each ZIM snapshot is a big file
ZIM is a binary file format in which Wikipedia snapshots are produced by the Kiwix project and published at https://download.kiwix.org/zim/wikipedia/
⬇️ Import to IPFS can be customized
See relevant parameters (
ipfs add --help
):Low hanging fruit:
--chunker
When we import a file to IPFS it is chunked into pieces, and this chunking algorithm can be customized.
The default one is just a static block of some fixed size (iirc 256KB), but in go-ipfs 0.5+ we support parameterized Rabin and experimental Buzhash chunkers.
💰 Goal: maximizing deduplication with
ipfs add
The key problem with keeping ZIM around is their size: Turkish Wikipedia is tens of gigabyes, English wikipedia is 650 GB. Keeping multiple snapshots of the same language is expensive. Any level of deduplication that IPFS could produce would be highly beneficial.
Kiwix project considered supporting something like zsync (kiwix/overview#12) however I believe
ipfs add
with custom rabin or buzhash could produce even better results. We need to do some benchmarks to validate that claim.Questions
📈 Benchmarking plan (TBD)
TBD, below is a doc with my initial ideas – needs refinement before we invest time, as it will be time-consuming to run it.
I wonder if we could leverage ipfs/DAGger for automating benchmarks – thoughts @ribasushi?
The text was updated successfully, but these errors were encountered: