Skip to content

Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nazar-pc opened this issue May 6, 2025 · 12 comments · May be fixed by #731
Open

Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729

nazar-pc opened this issue May 6, 2025 · 12 comments · May be fixed by #731
Assignees

Comments

@nazar-pc
Copy link
Contributor

nazar-pc commented May 6, 2025

I just upgraded from 3.6.12 to 3.7.4 (all the versions in between are yanked) and CI on Windows started to take absurd amount of time.

Here is the commit, which upgrades parity-scale-codec and parity-scale-codec-derive: nazar-pc/abundance@9e10736

Look at the root Cargo.toml and Cargo.lock specifically, nothing has changed except the version there, no code changes were done either.

Now CI run right before that commit (63.127s total):
https://github.com/nazar-pc/abundance/actions/runs/14859021046/job/41719016206#step:5:314

With that commit (887.445s total):
https://github.com/nazar-pc/abundance/actions/runs/14858533562/job/41718379958?pr=233#step:5:314

The slow part appears to be the decoding of Segment data structure:
https://github.com/nazar-pc/abundance/blob/d714ed30986ad5dd3c63c47d7cb97eda946076eb/crates/shared/ab-archiving/src/archiver.rs#L42-L191

It is a custom implementation and it is very simple.

I wasn't able to reproduce it locally on x86-64 Linux (both debug build and optimized), but it certainly happens reproducibly in CI on Windows.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 6, 2025

Tried using optimized version of parity-scale-codec in tests, but it didn't make a difference (it is supposed to be a negligible component of the total runtime of the test anyway). It must have hit some really pathological case on Windows specifically. Very strange.

@serban300
Copy link
Contributor

serban300 commented May 6, 2025

How big is the Segment that you are decoding ? Or does it contain big Vec<u8> fields ? I'm wondering if the issue is caused by this:

pub(crate) const MAX_PREALLOCATION: usize = 16 * 1024;

When decoding a vec, now we allocate at most 16KB each time when we need to increase it's size. The default rust implementation was doubling the memory each time.

Could you try to increase this from 16 to 512 KiB or 1024 for example and see if there is any improvement ?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 6, 2025

Segment should be exactly 128 MiB in encoded form and yes, it may include large vectors in it. Adding 16 kiB is certainly extremely undesirable, especially when we already know the exact size we'll be needing upfront.

@serban300
Copy link
Contributor

serban300 commented May 6, 2025

Yes, but it's safer to have a limit. Otherwise we can be tricked into allocating a lot of memory beforehand even if the data is invalid.

Could you try some bigger values for the MAX_PREALLOCATION and let us know what is the resulting CI running time please ? I think even a couple of MiB would be acceptable.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 6, 2025

Right. In my case I know the upper bound, the input is 128 MiB and any of the vectors individually can't be larger than that (encoding is not compression).

I can try later. You should be able to fork it and patch the version too, there isn't anything special in CI, it uses official GitHub runners.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 6, 2025

Changing that constant to 1M helped: https://github.com/nazar-pc/abundance/actions/runs/14863405031/job/41733981682

I think the logic should be the following:

  • initial allocation is encoding_request.min(MAX_PREALLOCATION)
  • for subsequent allocations use encoding_request.min(vec.capacity() * 2) as the target

While allocating a huge amount of memory right away is dangerous, being 2x off from current value isn't that bad actually.

Having some way to specify the limit would be nice too (maybe another ::decode_something() method).

@koute
Copy link
Contributor

koute commented May 6, 2025

Are you using Windows' default system allocator? If so can you try switching to a better memory allocator and see if that changes anything? (like e.g. jemalloc)

@serban300
Copy link
Contributor

I think the logic should be the following:

* initial allocation is `encoding_request.min(MAX_PREALLOCATION)`

* for subsequent allocations use `encoding_request.min(vec.capacity() * 2)` as the target

While allocating a huge amount of memory right away is dangerous, being 2x off from current value isn't that bad actually.

It can be dangerous in some situations. For example:

  • let's say that the max amount of memory that the system has available is for example 256 MB
  • you want to decode 129 MB
  • while decoding up to 128 MB all is well
  • when starting to decode the last MB you need to take all the system memory => exception

This is a bit of a stretch, but anyway it should be safer to work with small allocations. I think 1MB, maybe even 4-8 MB chunks could be ok. But doubling could mean a lot at some point.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 7, 2025

Are you using Windows' default system allocator? If so can you try switching to a better memory allocator and see if that changes anything? (like e.g. jemalloc)

I use mimalloc in apps, but this is tests, I'm not going to set a custom allocator in every test file. That said, performance of default allocator on Windows is atrocious, neither Linux nor macOS have this bad of a behavior.

when starting to decode the last MB you need to take all the system memory => exception

This example is doomed from the start. Assuming you actually decode 129 MB, it is likely that you'll have final 129 MB allocation and something that is 129 MB - 16 kB (assuming OS is unable to simply extend existing allocation and creates a separate one, which is probably what Windows is doing here, otherwise I can't imagine what else it is doing that is this expensive).

The only proper solution here is to decode a vector with an explicit upper bound provided by the user. For example I know I'm decoding it from a vector (meaning it is all in RAM already) and I know the max allowed size (can't be larger even in theory), so I replaced the call to decode_vec_with_len() with this:

        let mut bytes = vec![0; length as usize];
        input.read(&mut bytes)?;

It'd be nice to have an official API that does something equivalent.

@koute
Copy link
Contributor

koute commented May 7, 2025

This is a bit of a stretch, but anyway it should be safer to work with small allocations. I think 1MB, maybe even 4-8 MB chunks could be ok. But doubling could mean a lot at some point.

Note that depending on the chunk size this might open you up to DoS attacks.

Let's say you start with a 1MB vector, and you append to it until it grows up to 128MB. If you double its size each time you run out of space then you only need to copy 127MB of data (assuming each time you increase the size of the vector you can't do it in-place and you need to copy it).

Now if instead of doubling it if you grow it only by 1MB each time then you'll have to copy 8128MB of data (if I didn't mess up the calculations), which is ~64 times more.

And if we grow it only by 16kB at a time we'll have to copy ~524GB at the worst case. So @serban300 the current code might actually be broken security-wise if it just always grows it by 16kB.


In the end we are more-or-less just patching up the symptoms of the fact that we're using a crappy allocator for runtimes, and the proper fix would be to replace it so that overallocating the memory is not a problem. Alas it's not that simple to do.

@serban300
Copy link
Contributor

Yes, I agree, this is not ideal at all. It only works ok if we don't have big vecs, which in general is true for Polkadot.

This example is doomed from the start. Assuming you actually decode 129 MB, it is likely that you'll have final 129 MB allocation

Actually sorry, you're right. In my example above, if we do encoding_request.min(vec.capacity() * 2) at each iteration, we won't get an OOM exception. We will end up allocating min(256, 129) = 129MB at the last iteration, which is ok. The problem is if we just do vec.capacity() * 2 blindly without an upper bound. So doing encoding_request.min(vec.capacity() * 2) at each subsequent iteration sounds good to me.

@serban300 serban300 linked a pull request May 8, 2025 that will close this issue
@serban300
Copy link
Contributor

Opened PR #731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants