Question: Caching of index files corresponding to a local chunk store #5

nunojpg · 2022-11-12T14:40:56Z

I think this feature is what you refer as sideload in the code.

Since the chunks have variable size, isn't this only useful if your new version happens to have many chunk boundaries in common?

florolf · 2022-11-12T18:27:04Z

I'm not sure I fully understand what you mean.

The main application for this tool is implementing some kind of A/B-partition-based update scheme. In such a scheme, when currently running from (for example) partition A, you'd write the new file system image into partition B. The idea is that if partition A is read-only (which is a common approach in such systems), you can safely index it as a source for chunks. This has a significant hit rate because the image present in partition A usually shares many parts with the image you're currently writing into partition B, being an older version of the same firmware. This is what helps reduce the amount of data you need to download from the internet and what casync-nano calls a "local chunk store".

However, a problem with that is that to know which chunks are present in partition A, you have to run the algorithm for determining chunk boundaries (+ hashing the chunks) on that whole partition. This is rather computationally expensive. But when partition A has been read-only the whole time, its contents are exactly equivalent to what a previous update has written into partition A. If, during that update, you kept the corresponding caibx file, you already know where the chunks live and what their hashes are, because that's just what a caibx file contains. casync-nano still verifies the hashes of chunks it fetches from a local store that was initialized in such a way to combat bit-rot, which means that you still have to pay the computational cost for hashing a chunk retrieved from a local chunk store, but this is much cheaper than the chunking algorithm and you only pay the hash cost for actual hits on that store. This is the "sideloading" mechanism. But this is a completely optional optimization. If you don't supply such a sideload file (or it detects a mismatch with the actual data, see here) casync-nano will happily reindex the partition.

How, where and whether to store these caibx files is deliberately not something casync-nano has an opinion on. That's something you have to decide when using it as a building block in some larger OTA system.

Does that answer your question?

nunojpg · 2022-11-12T22:51:38Z

Yes, I understand that (I am using RAUC), but the question is another,

This would make a lot of sense if for example you always used fixed size chunks.

But in fact the chunk size and offsets are a bit random, so at the local store phase you calculate the SHA256 not as in the index that you previously installed, but instead at the size and offset of the incoming version!

Let's imagine I have version 50 installed at partition A (of 16MiB), and it was delivered using fixed chunks of 4KiB, so the index had 4096 SHA256 entries (and offsets).

Now I want to update to version 51 and I get the index and instead if has 2048 entries (each of 8KiB). I now need to hash Partition A according to this 2048 entries so I can compare them! The previous hashes are completely irrelevant.

Casync uses variable size chunks, so this was of course a simplification.

So I wonder if you found that casync would still often chose the same offsets so that the hashes can be reused.

florolf · 2022-11-20T15:48:07Z

I'm not sure I understand what you mean. The idea behind casync is that chunks are being generated using an algorithm that determines chunk boundaries based on patterns in the data that is being chunked. This means that a sufficiently long run of equivalent data in different images will result in the same chunks being generated for that run (excluding the beginning and the end, which might be different), regardless of differing offsets. See this blog post for an explanation of the idea.

However, casync-nano itself is mostly agnostic to how chunks are cut. The only place where it matters is when you are not sideloading a caibx file for a local store (or bitrot has been detected and the sideloaded index is discarded). In that case, casync-nano will recalculate the chunk boundaries using the same exact algorithm casync uses, taking the algorithm parameters from the new caibx file (i.e. the one that describes the image to be installed). The resulting chunks will only be useful if:

The chunks in the new caibx were generated using the same chunking algorithm
The chunker parameters were the same

But this is pretty much a given since 1 will be the case if you used casync make to generate the file. 2 will also usually be true since you're normally using the defaults anyway, but even if you aren't it doesn't matter as long as you don't change them all the time.

nunojpg · 2022-11-20T15:59:12Z

Yes, you now understood my question and yes, I am wrong :D
I was really confused about how the chunking algorithm works, I didn't understood it was mostly reproducible.
I have since implemented it in our code base but it was greatly improved by your ideas. Thanks for the work.

nunojpg closed this as completed Nov 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Caching of index files corresponding to a local chunk store #5

Question: Caching of index files corresponding to a local chunk store #5

nunojpg commented Nov 12, 2022

florolf commented Nov 12, 2022

nunojpg commented Nov 12, 2022

florolf commented Nov 20, 2022

nunojpg commented Nov 20, 2022

Question: Caching of index files corresponding to a local chunk store #5

Question: Caching of index files corresponding to a local chunk store #5

Comments

nunojpg commented Nov 12, 2022

florolf commented Nov 12, 2022

nunojpg commented Nov 12, 2022

florolf commented Nov 20, 2022

nunojpg commented Nov 20, 2022