Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

nazar-pc · 2025-05-06T12:06:12Z

I just upgraded from 3.6.12 to 3.7.4 (all the versions in between are yanked) and CI on Windows started to take absurd amount of time.

Here is the commit, which upgrades parity-scale-codec and parity-scale-codec-derive: nazar-pc/abundance@9e10736

Look at the root Cargo.toml and Cargo.lock specifically, nothing has changed except the version there, no code changes were done either.

Now CI run right before that commit (63.127s total):
https://github.com/nazar-pc/abundance/actions/runs/14859021046/job/41719016206#step:5:314

With that commit (887.445s total):
https://github.com/nazar-pc/abundance/actions/runs/14858533562/job/41718379958?pr=233#step:5:314

The slow part appears to be the decoding of Segment data structure:
https://github.com/nazar-pc/abundance/blob/d714ed30986ad5dd3c63c47d7cb97eda946076eb/crates/shared/ab-archiving/src/archiver.rs#L42-L191

It is a custom implementation and it is very simple.

I wasn't able to reproduce it locally on x86-64 Linux (both debug build and optimized), but it certainly happens reproducibly in CI on Windows.

The text was updated successfully, but these errors were encountered:

nazar-pc · 2025-05-06T12:52:31Z

Tried using optimized version of parity-scale-codec in tests, but it didn't make a difference (it is supposed to be a negligible component of the total runtime of the test anyway). It must have hit some really pathological case on Windows specifically. Very strange.

serban300 · 2025-05-06T13:02:47Z

How big is the Segment that you are decoding ? Or does it contain big Vec<u8> fields ? I'm wondering if the issue is caused by this:

parity-scale-codec/src/codec.rs

Line 51 in 0a0295a

pub(crate) const MAX_PREALLOCATION: usize = 16 * 1024;

When decoding a vec, now we allocate at most 16KB each time when we need to increase it's size. The default rust implementation was doubling the memory each time.

Could you try to increase this from 16 to 512 KiB or 1024 for example and see if there is any improvement ?

nazar-pc · 2025-05-06T13:04:59Z

Segment should be exactly 128 MiB in encoded form and yes, it may include large vectors in it. Adding 16 kiB is certainly extremely undesirable, especially when we already know the exact size we'll be needing upfront.

serban300 · 2025-05-06T13:12:33Z

Yes, but it's safer to have a limit. Otherwise we can be tricked into allocating a lot of memory beforehand even if the data is invalid.

Could you try some bigger values for the MAX_PREALLOCATION and let us know what is the resulting CI running time please ? I think even a couple of MiB would be acceptable.

nazar-pc · 2025-05-06T13:14:55Z

Right. In my case I know the upper bound, the input is 128 MiB and any of the vectors individually can't be larger than that (encoding is not compression).

I can try later. You should be able to fork it and patch the version too, there isn't anything special in CI, it uses official GitHub runners.

nazar-pc · 2025-05-06T15:34:49Z

Changing that constant to 1M helped: https://github.com/nazar-pc/abundance/actions/runs/14863405031/job/41733981682

I think the logic should be the following:

initial allocation is encoding_request.min(MAX_PREALLOCATION)
for subsequent allocations use encoding_request.min(vec.capacity() * 2) as the target

While allocating a huge amount of memory right away is dangerous, being 2x off from current value isn't that bad actually.

Having some way to specify the limit would be nice too (maybe another ::decode_something() method).

koute · 2025-05-06T22:05:11Z

Are you using Windows' default system allocator? If so can you try switching to a better memory allocator and see if that changes anything? (like e.g. jemalloc)

serban300 · 2025-05-07T08:11:38Z

I think the logic should be the following:
* initial allocation is `encoding_request.min(MAX_PREALLOCATION)`

* for subsequent allocations use `encoding_request.min(vec.capacity() * 2)` as the target
While allocating a huge amount of memory right away is dangerous, being 2x off from current value isn't that bad actually.

It can be dangerous in some situations. For example:

let's say that the max amount of memory that the system has available is for example 256 MB
you want to decode 129 MB
while decoding up to 128 MB all is well
when starting to decode the last MB you need to take all the system memory => exception

This is a bit of a stretch, but anyway it should be safer to work with small allocations. I think 1MB, maybe even 4-8 MB chunks could be ok. But doubling could mean a lot at some point.

nazar-pc · 2025-05-07T09:09:58Z

Are you using Windows' default system allocator? If so can you try switching to a better memory allocator and see if that changes anything? (like e.g. jemalloc)

I use mimalloc in apps, but this is tests, I'm not going to set a custom allocator in every test file. That said, performance of default allocator on Windows is atrocious, neither Linux nor macOS have this bad of a behavior.

when starting to decode the last MB you need to take all the system memory => exception

This example is doomed from the start. Assuming you actually decode 129 MB, it is likely that you'll have final 129 MB allocation and something that is 129 MB - 16 kB (assuming OS is unable to simply extend existing allocation and creates a separate one, which is probably what Windows is doing here, otherwise I can't imagine what else it is doing that is this expensive).

The only proper solution here is to decode a vector with an explicit upper bound provided by the user. For example I know I'm decoding it from a vector (meaning it is all in RAM already) and I know the max allowed size (can't be larger even in theory), so I replaced the call to decode_vec_with_len() with this:

        let mut bytes = vec![0; length as usize];
        input.read(&mut bytes)?;

It'd be nice to have an official API that does something equivalent.

koute · 2025-05-07T22:38:00Z

This is a bit of a stretch, but anyway it should be safer to work with small allocations. I think 1MB, maybe even 4-8 MB chunks could be ok. But doubling could mean a lot at some point.

Note that depending on the chunk size this might open you up to DoS attacks.

Let's say you start with a 1MB vector, and you append to it until it grows up to 128MB. If you double its size each time you run out of space then you only need to copy 127MB of data (assuming each time you increase the size of the vector you can't do it in-place and you need to copy it).

Now if instead of doubling it if you grow it only by 1MB each time then you'll have to copy 8128MB of data (if I didn't mess up the calculations), which is ~64 times more.

And if we grow it only by 16kB at a time we'll have to copy ~524GB at the worst case. So @serban300 the current code might actually be broken security-wise if it just always grows it by 16kB.

In the end we are more-or-less just patching up the symptoms of the fact that we're using a crappy allocator for runtimes, and the proper fix would be to replace it so that overallocating the memory is not a problem. Alas it's not that simple to do.

serban300 · 2025-05-08T06:12:31Z

Yes, I agree, this is not ideal at all. It only works ok if we don't have big vecs, which in general is true for Polkadot.

This example is doomed from the start. Assuming you actually decode 129 MB, it is likely that you'll have final 129 MB allocation

Actually sorry, you're right. In my example above, if we do encoding_request.min(vec.capacity() * 2) at each iteration, we won't get an OOM exception. We will end up allocating min(256, 129) = 129MB at the last iteration, which is ok. The problem is if we just do vec.capacity() * 2 blindly without an upper bound. So doing encoding_request.min(vec.capacity() * 2) at each subsequent iteration sounds good to me.

serban300 · 2025-05-08T13:38:04Z

Opened PR #731

This was referenced May 6, 2025

Backport Subspace fixes nazar-pc/abundance#233

Merged

Bump upstream commit with backported codec related fixes autonomys/subspace#3518

Merged

serban300 self-assigned this May 6, 2025

serban300 linked a pull request May 8, 2025 that will close this issue

Fix performance regression #731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

nazar-pc commented May 6, 2025 •

edited

Loading

nazar-pc commented May 6, 2025

serban300 commented May 6, 2025 •

edited

Loading

nazar-pc commented May 6, 2025

serban300 commented May 6, 2025 •

edited

Loading

nazar-pc commented May 6, 2025

nazar-pc commented May 6, 2025 •

edited

Loading

koute commented May 6, 2025 •

edited

Loading

serban300 commented May 7, 2025

nazar-pc commented May 7, 2025

koute commented May 7, 2025

serban300 commented May 8, 2025

serban300 commented May 8, 2025

Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

Comments

nazar-pc commented May 6, 2025 • edited Loading

nazar-pc commented May 6, 2025

serban300 commented May 6, 2025 • edited Loading

nazar-pc commented May 6, 2025

serban300 commented May 6, 2025 • edited Loading

nazar-pc commented May 6, 2025

nazar-pc commented May 6, 2025 • edited Loading

koute commented May 6, 2025 • edited Loading

serban300 commented May 7, 2025

nazar-pc commented May 7, 2025

koute commented May 7, 2025

serban300 commented May 8, 2025

serban300 commented May 8, 2025

nazar-pc commented May 6, 2025 •

edited

Loading

serban300 commented May 6, 2025 •

edited

Loading

serban300 commented May 6, 2025 •

edited

Loading

nazar-pc commented May 6, 2025 •

edited

Loading

koute commented May 6, 2025 •

edited

Loading