-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redo the swap code for better tail & padding handling #134954
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
468babb
to
932f981
Compare
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Redo the swap code for better tail & padding handling A couple of parts here ## Fixes rust-lang#134713 Actually using the type passed to `ptr::swap_nonoverlapping` for anything other than its size + align turns out to not work, so this redo goes back to *always* swapping via not-`!noundef` integers. (Except in `const`, which keeps doing the same thing as before to preserve `@RalfJung's` fix from rust-lang#134689) ## Fixes rust-lang#134946 I'd previously moved the swapping to use auto-vectorization, but someone pointed out on Discord that the tail loop handling from that left a whole bunch of byte-by-byte swapping around. This PR goes back to a manual chunk, with at most logarithmic more instructions for the tail. (There are other ways that could potentially handle the tail even better, but this seems to be pretty good, since it's how LLVM ends up lowering operations on types like `i56`.) ## Polymorphization Since it's silly to have separate copies of swapping -- especially *untyped* swapping! -- for `u32`, `i32`, `f32`, `[u16; 2]`, etc, this sends everything to byte versions, but still mono'd by alignment. That should make it more ok that the code is somewhat more complex, since we only get 7 monomorphizations of the complicated bit. (One day we'll be able to remove some of the hacks by being able to just call `foo::<{align_of::<T>()}>`, but since alignments are only powers of two, the manual dispatch out isn't a big deal.)
The job Click to see the possible cause of the failure (guessed by this bot)
|
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (407c8de): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 0.1%, secondary 2.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary 1.8%, secondary 1.5%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.7%, secondary 0.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 761.022s -> 763.29s (0.30%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our swap code is slowly turning into a huge pile of spaghetti, which is concerning... do we really need two intrinsics?
// and our fresh local is always disjoint from anything otherwise readable. | ||
unsafe { | ||
(&raw mut temp).copy_from_nonoverlapping(x, 1); | ||
x.copy_from_nonoverlapping(y, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x.copy_from_nonoverlapping(y, 1); | |
// Important for miri::intrinsic_fallback_is_spec: here we get language UB | |
// if x and y overlap. | |
x.copy_from_nonoverlapping(y, 1); |
//#[rustc_allow_const_fn_unstable(const_eval_select)] | ||
#[rustc_const_unstable(feature = "const_swap_nonoverlapping", issue = "133668")] | ||
#[inline] | ||
pub(crate) const unsafe fn swap_nonoverlapping<T>(x: *mut T, y: *mut T, count: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment clarifying whether this is typed or untyped.
MAX_ALIGN.. => MAX_ALIGN, | ||
64.. => 64, | ||
32.. => 32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 32bit machines, the 64..
and 32..
arms are dead code. Is that deliberate? If so, that should be explained in comments.
x: &mut mem::MaybeUninit<C>, | ||
y: &mut mem::MaybeUninit<C>, | ||
) { | ||
assert!(size_of::<C>() % ALIGN == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit odd to see a runtime assertion in code like this. Does this work in a const { ... }
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how I first had it, but no, it doesn't because calls to this are guarded with "runtime" checks on the specific ALIGN
used -- aka it would need a const-eval-impacting const if
for the caller.
I could change it to debug_assert or just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Just adding a comment about this could already help.
hint::assert_unchecked(x.is_aligned_to(ALIGN)); | ||
hint::assert_unchecked(y.is_aligned_to(ALIGN)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these actually help? Seems worth commenting on that since experience shows that assert_unchecked
can often hurt codegen.
@@ -498,6 +498,23 @@ impl<'ll, 'tcx> IntrinsicCallBuilderMethods<'tcx> for Builder<'_, 'll, 'tcx> { | |||
} | |||
} | |||
|
|||
sym::untyped_swap_nonoverlapping => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason for why this is handled in a different way than typed_swap_nonoverlapping
, and in a completely different place as well?
I certainly didn't want to add another one :P From a myopic perspective, they're for two different things:
I'd be happy to change either if you have good ideas. For example, I might be able to make a |
I'm afraid I don't have the overview to suggest a different architecture here... I haven't quite gotten to the bottom of the spaghetti yet. ;) |
A couple of parts here
Fixes #134713
Actually using the type passed to
ptr::swap_nonoverlapping
for anything other than its size + align turns out to not work, so this redo goes back to always swapping via not-!noundef
integers.(Except in
const
, which keeps doing the same thing as before to preserve @RalfJung's fix from #134689)Fixes #134946
I'd previously moved the swapping to use auto-vectorization, but someone pointed out on Discord that the tail loop handling from that left a whole bunch of byte-by-byte swapping around. This PR goes back to a manual chunk, with at most logarithmic more instructions for the tail.
(There are other ways that could potentially handle the tail even better, but this seems to be pretty good, since it's how LLVM ends up lowering operations on types like
i56
.)Polymorphization
Since it's silly to have separate copies of swapping -- especially untyped swapping! -- for
u32
,i32
,f32
,[u16; 2]
, etc, this sends everything to byte versions, but still mono'd by alignment. That should make it more ok that the code is somewhat more complex, since we only get 7 monomorphizations of the complicated bit.(One day we'll be able to remove some of the hacks by being able to just call
foo::<{align_of::<T>()}>
, but since alignments are only powers of two, the manual dispatch out isn't a big deal.)