-
Notifications
You must be signed in to change notification settings - Fork 98
Decoding performance regression from 3.6.12 to 3.7.4 (Windows) #729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tried using optimized version of |
How big is the parity-scale-codec/src/codec.rs Line 51 in 0a0295a
When decoding a Could you try to increase this from 16 to 512 KiB or 1024 for example and see if there is any improvement ? |
|
Yes, but it's safer to have a limit. Otherwise we can be tricked into allocating a lot of memory beforehand even if the data is invalid. Could you try some bigger values for the |
Right. In my case I know the upper bound, the input is 128 MiB and any of the vectors individually can't be larger than that (encoding is not compression). I can try later. You should be able to fork it and patch the version too, there isn't anything special in CI, it uses official GitHub runners. |
Changing that constant to 1M helped: https://github.com/nazar-pc/abundance/actions/runs/14863405031/job/41733981682 I think the logic should be the following:
While allocating a huge amount of memory right away is dangerous, being 2x off from current value isn't that bad actually. Having some way to specify the limit would be nice too (maybe another |
Are you using Windows' default system allocator? If so can you try switching to a better memory allocator and see if that changes anything? (like e.g. jemalloc) |
It can be dangerous in some situations. For example:
This is a bit of a stretch, but anyway it should be safer to work with small allocations. I think 1MB, maybe even 4-8 MB chunks could be ok. But doubling could mean a lot at some point. |
I use mimalloc in apps, but this is tests, I'm not going to set a custom allocator in every test file. That said, performance of default allocator on Windows is atrocious, neither Linux nor macOS have this bad of a behavior.
This example is doomed from the start. Assuming you actually decode 129 MB, it is likely that you'll have final 129 MB allocation and something that is 129 MB - 16 kB (assuming OS is unable to simply extend existing allocation and creates a separate one, which is probably what Windows is doing here, otherwise I can't imagine what else it is doing that is this expensive). The only proper solution here is to decode a vector with an explicit upper bound provided by the user. For example I know I'm decoding it from a vector (meaning it is all in RAM already) and I know the max allowed size (can't be larger even in theory), so I replaced the call to let mut bytes = vec![0; length as usize];
input.read(&mut bytes)?; It'd be nice to have an official API that does something equivalent. |
Note that depending on the chunk size this might open you up to DoS attacks. Let's say you start with a 1MB vector, and you append to it until it grows up to 128MB. If you double its size each time you run out of space then you only need to copy 127MB of data (assuming each time you increase the size of the vector you can't do it in-place and you need to copy it). Now if instead of doubling it if you grow it only by 1MB each time then you'll have to copy 8128MB of data (if I didn't mess up the calculations), which is ~64 times more. And if we grow it only by 16kB at a time we'll have to copy ~524GB at the worst case. So @serban300 the current code might actually be broken security-wise if it just always grows it by 16kB. In the end we are more-or-less just patching up the symptoms of the fact that we're using a crappy allocator for runtimes, and the proper fix would be to replace it so that overallocating the memory is not a problem. Alas it's not that simple to do. |
Yes, I agree, this is not ideal at all. It only works ok if we don't have big
Actually sorry, you're right. In my example above, if we do |
Opened PR #731 |
I just upgraded from 3.6.12 to 3.7.4 (all the versions in between are yanked) and CI on Windows started to take absurd amount of time.
Here is the commit, which upgrades parity-scale-codec and parity-scale-codec-derive: nazar-pc/abundance@9e10736
Look at the root Cargo.toml and Cargo.lock specifically, nothing has changed except the version there, no code changes were done either.
Now CI run right before that commit (63.127s total):
https://github.com/nazar-pc/abundance/actions/runs/14859021046/job/41719016206#step:5:314
With that commit (887.445s total):
https://github.com/nazar-pc/abundance/actions/runs/14858533562/job/41718379958?pr=233#step:5:314
The slow part appears to be the decoding of
Segment
data structure:https://github.com/nazar-pc/abundance/blob/d714ed30986ad5dd3c63c47d7cb97eda946076eb/crates/shared/ab-archiving/src/archiver.rs#L42-L191
It is a custom implementation and it is very simple.
I wasn't able to reproduce it locally on x86-64 Linux (both debug build and optimized), but it certainly happens reproducibly in CI on Windows.
The text was updated successfully, but these errors were encountered: