-
Notifications
You must be signed in to change notification settings - Fork 94
Add optional features that can potentially improve performance #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is really cool work! Let me work my way through the diff and comment as I go, and then I'll leave some concluding remarks at the end! |
Okay, so overall I'd say:
|
Running some new benchmarks now. I've limited the affinity to 8 real CPU cores (no SMT), and set it to high priority to hopefully reduce some noise. |
I think what you want to do is something similar to the script in |
In that case, I'll bump it to 12 real cores and go overboard on the granularity of the benchmarks. This may take a while, but we should get some pretty graphs! I've realized by now that these benchmarks are really just for people glancing at the library. Now that I understand how it works, the reader benchmarks are basically just benchmarking the hashmap implementation, plus a few atomic loads. Speaking of... I just noticed an issue that I will fix after benchmarking. In let rs = unsafe { Box::from_raw(r_handle) };
res = Some(f(&*rs));
mem::forget(rs); // don't free the Box! However, if |
Yup, the benchmarker is essentially trying to generate as much contention as it can, because that's where we want to understand the behavior characteristics. If there's no contention, the performance is mostly just the performance of the underlying As for the snippet you gave, I think the real solution is to make that code simply: res = Some(f(unsafe { &*r_handle })); That way there's no issue with panics. I believe that should work just fine. |
I've been experiencing some odd results that don't reflect what I saw when first adding these features, to the point I'm beginning to think I'm running into memory bandwidth limitations on my system. When I first benchmarked these, they ran at normal priority (or dynamically lower), and I had netflix on in the background. Typical stuff. It was a not-insignificant load, but it was consistent. I tried to not do anything drastic, but I didn't want to just stare at a blank screen for 20 minutes. That's when I observed the speedups. However, when I increased the process priority and basically dedicated my system to the benchmarks, the results across all new features were equal, no change. We already agreed that the simple reader tests are basically just benchmarking the hashmap implementations, and I'm not even seeing the boosts that should be expected of hashbrown, given hashbrown's benchmarks destroy std's. So, what I think is going on is specific to me. Bad luck or bad planning, I suppose. As stated previously, I have a Threadripper 1950X, and I also have 128GB of RAM set up in a quad channel config. Unfortunately for 1st gen Threadripper, quad channel configs effectively halve the memory speed under certain workloads. Randomly accessing super-optimized hashmaps at very high OS priorities appears to hit its limit. What must have happened for the casual-use benchmarks is that it really is faster, and the OS was distributing the work in such a way that the presumed lower operational latency of operations with the new features gave it an edge. It was apparently still hitting that limit. However, the slower original config was having to be shared with the background tasks, and did not get a chance to hit that memory bandwidth limit. This has certainly been an interesting adventure for so few lines of code changed. What are your thoughts? |
Yeah, I've observed similar behavior in the past too. Its likely in part because the benchmark is so simple. One good strategy might be to run evmap through the YCSB or something else that's more realistic, and then measure the difference. There's also this page that measures hash tables directly, which we may be able to adapt to a concurrent workload. I also have a larger integration benchmark I could run with noria (cc @ms705), and that might help shine some light on the improvements. Overall, I'd be inclined to merge this regardless; the additions are both sensible and optional. I do think, as discussed in one of the comments, that both features should be on by default though. I'd also appreciate a fix to the |
Perhaps we could use a slow CPU with fast memory for a fair benchmark? Or I may be able to underclock my CPU to compensate. 32 hardware threads at around 1 GHz (vs. 3.71 GHz now) would probably give enough contention. Anyway, I'll go ahead and make those optional features default and then it should be ready to merge. |
Everything should be ready. |
I don't see whatever last pending review there is, but I think I covered everything. |
Awesome, thank you so much! And thanks for your patience in landing the PR too :D I'm excited to try this out in noria and see if we notice any speedup! |
Oh... you already released it? I still need to fix those I'll do those tonight anyway and they can be added to a patch version. |
Yeah, the |
This started out as an experiment to see if I could make
evmap
work better with the libraries I'm already using in my projects, but there really isn't a reason not to provide them as features.The big ones:
hashbrown
: Fast and space efficient hash table replacement.parking_lot
: Fast and space efficient synchronization primitives.These two are obvious, as they provide generally better and potentially safer functionality (
parking_lot
doesn't poison threads).Additionally:
smallvec
: Mixed stack/inline/heap allocated vector.Although the multi-value map is useful for multiple insertions, if a map never uses that and only ever replaces a single value with
update
, we can cheat by effectively using[T; 1]
instead, which saves us an extra level of indirection.smallvec
does grow and allocate if necessary.Combined, this seems to net about a 50% performance increase on my system, but I need to do a clean run of benchmarks before this is merged. It's certainly better. Plus
hashbrown
is supposed to be more space efficient.I switched to log10 because the others were getting lost. Also, I have a 16C/32T CPU, so I bumped up the reader counts to compensate. Although for new benchmarks I may try to change the CPU affinity.
Write speeds are about the same.
What are your thoughts?