-
Notifications
You must be signed in to change notification settings - Fork 20
ACP: efficient runtime checking of multiple target features #585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Am I missing something? Your example alternative code never checks for |
ah right (I've been running a bunch of benchmarks with different configurations). Fixed now, thanks! |
Prior relevant discussion: https://internals.rust-lang.org/t/better-codegen-for-cpu-feature-detection/22083 |
even that code is inefficient, it could just be: and eax, 2
shr eax, 1 or if it was immediately used for branching, just: test eax, 2
jnz has_feature |
I know in vtables we add some special LLVM things to tell it that re-reading will always give the same value, even if there's other stuff between. Obviously we can't do that for the lazy-init part, but maybe in the normal case after that there'd be a way?
Does it have to be a bitmap? Could we return some kind of x86-specific library type with (Or even let that type exist on all platforms, just trivially returns false for everything.) |
Sure, it doesn't have to be literally a bitmap, there is a lot of freedom in exactly how to implement it. One tricky thing is that the features are stored as 3 atomics, so depending on what features you ask for, one load might have all the bits you need, or you might need all 3 loads. So we don't want to repeat work when 2 features are stored in the same atomic value, but also don't want to pessimistically load all three values. static CACHE: [Cache; 3] = [
Cache::uninitialized(),
Cache::uninitialized(),
Cache::uninitialized(),
];
struct Cache(AtomicUsize); Also looking at this now, that static might benefit from |
Note that this has some interaction with the accepted RFC for adding splitting these macros into |
Proposal
Problem statement
Currently, checking for whether two target features are enabled is inefficient. In zlib-rs we see a 3% slowdown in one test case from checking for an additional target feature.
Performing a runtime check for 2 target features requires roughly double the number of instructions versus checking for just one feature.
Motivating examples or use cases
In zlib-rs, we want to check for both the
avx2
andbmi2
features, but that check is slower than just checking foravx2
.Looking at just the happy path (where the features are already cached and both are available):
https://godbolt.org/z/f935sP6dr
So checking for 2 features roughly doubles the number of instructions, and performs 2 (atomic) loads.
This all makes sense, given that the cache is stored in an atomic, so the read value cannot be reused, and the expansion looks like this:
Solution sketch
I'd like the macro to expand to something like this instead, where
__is_feature_detected()
returns a bitmap of enabled features:For that to work, a single call to a
is_*_feature_detected
macro must be able to accept multiple target features. I can see two ways to do that:is_x86_feature_detected("avx2", "bmi")
is_x86_feature_detected("avx2,bmi")
Option 2 has precedent in e.g.
#[target_feature(enable = "avx2,bmi2")]
, but option 1 can (I believe) be implemented withmacro_rules!
and also works better with e.g.#[cfg(...)]
. I personally prefer option 1.Alternatives
There is a workaround:
Links and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: