Skip to content
Merged
56 changes: 43 additions & 13 deletions src/libcore/mem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ pub use intrinsics::transmute;
/// [`Clone`][clone]. You need the value's destructor to run only once,
/// because a double `free` is undefined behavior.
///
/// An example is the definition of [`mem::swap`][swap] in this module:
/// An example is a possible implementation of [`mem::swap`][swap]:
///
/// ```
/// use std::mem;
Expand Down Expand Up @@ -447,18 +447,48 @@ pub unsafe fn uninitialized<T>() -> T {
#[stable(feature = "rust1", since = "1.0.0")]
pub fn swap<T>(x: &mut T, y: &mut T) {
unsafe {
// Give ourselves some scratch space to work with
let mut t: T = uninitialized();

// Perform the swap, `&mut` pointers never alias
ptr::copy_nonoverlapping(&*x, &mut t, 1);
ptr::copy_nonoverlapping(&*y, x, 1);
ptr::copy_nonoverlapping(&t, y, 1);

// y and t now point to the same thing, but we need to completely
// forget `t` because we do not want to run the destructor for `T`
// on its value, which is still owned somewhere outside this function.
forget(t);
// The approach here is to utilize simd to swap x & y efficiently. Testing reveals
// that swapping either 32 bytes or 64 bytes at a time is most efficient for intel
// Haswell E processors. LLVM is more able to optimize if we give a struct a
// #[repr(simd)], even if we don't actually use this struct directly.
#[repr(simd)]
struct Block(u64, u64, u64, u64);
struct UnalignedBlock(u64, u64, u64, u64);

let block_size = size_of::<Block>();

// Get raw pointers to the bytes of x & y for easier manipulation
let x = x as *mut T as *mut u8;
let y = y as *mut T as *mut u8;

// Loop through x & y, copying them `Block` at a time
// The optimizer should unroll the loop fully for most types
// N.B. We can't use a for loop as the `range` impl calls `mem::swap` recursively
let len = size_of::<T>() as isize;
let mut i = 0;
while i + block_size as isize <= len {
// Create some uninitialized memory as scratch space
// Decaring `t` here avoids aligning the stack when this loop is unused
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Declaring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gah!

let mut t: Block = uninitialized();
let t = &mut t as *mut _ as *mut u8;

// Swap a block of bytes of x & y, using t as a temporary buffer
// This should be optimized into efficient SIMD operations where available
ptr::copy_nonoverlapping(x.offset(i), t, block_size);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), block_size);
ptr::copy_nonoverlapping(t, y.offset(i), block_size);
i += block_size as isize;
}
if i < len {
// Swap any remaining bytes
let mut t: UnalignedBlock = uninitialized();
let t = &mut t as *mut _ as *mut u8;

let rem = (len - i) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever matters, but it occurs to me that calculating rem as len % 16 instead of len - i might be more obvious to an optimizer, as knowing what len - i is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0 than to test i < len, as it is more trivially determined at compile time.

I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it; looks like the generated assembly is the same either way. len here is a constant known at compile time so I guess it just gets completely eliminated

ptr::copy_nonoverlapping(x.offset(i), t, rem);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
ptr::copy_nonoverlapping(t, y.offset(i), rem);
}
}
}

Expand Down