-
Notifications
You must be signed in to change notification settings - Fork 49
Rewrite SIMD intrinsic with Rust? #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@AdamNiederer that's a good challenge for your faster library! |
This is a pretty much mot-a-mot translation into Rust SS3 intrinsics. It compiles in stable and passes all the accumulate tests. Can make a pull request, or @rtsuk || @raphlinus can add this to make it quicker. use std::mem;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
macro_rules! _mm_shuffle {
($z:expr, $y:expr, $x:expr, $w:expr) => {
($z << 6) | ($y << 4) | ($x << 2) | $w
};
}
#[inline]
#[target_feature(enable = "sse3")]
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[allow(unsafe_code)]
pub unsafe fn accumulate_sse(input: &[f32], out: &mut Vec<u8>, n: usize) {
let mut offset = _mm_setzero_ps();
let sign_mask = _mm_set1_ps(-0.);
let mask = _mm_set1_epi32(0x0c080400);
for i in (0..n).step_by(4) {
let mut x = _mm_loadu_ps(&input[i]);
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
x = _mm_add_ps(x, offset);
let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
y = _mm_min_ps(y, _mm_set1_ps(1.0));
y = _mm_mul_ps(y, _mm_set1_ps(255.0));
let mut z = _mm_cvttps_epi32(y);
z = _mm_shuffle_epi8(z, mask);
_mm_store_ss(mem::transmute(&out[i]), _mm_castsi128_ps(z));
offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
}
}
fn accumulate(src: &[f32]) -> Vec<u8> {
let len = src.len();
let n = (len + 3) & !3; // align data
let mut dst: Vec<u8> = vec![0; n]; // Vec::with_capacity(n) won't work here
unsafe {
accumulate_sse(src, &mut dst, n);
dst.set_len(len); // we must return vec of the same length as src.len()
}
dst
} or merge the simd function with the top level one use std::mem;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
macro_rules! _mm_shuffle {
($z:expr, $y:expr, $x:expr, $w:expr) => {
($z << 6) | ($y << 4) | ($x << 2) | $w
};
}
#[inline]
#[cfg(feature = "sse")]
#[allow(unsafe_code)]
pub unsafe fn accumulate(src: &[f32]) -> Vec<u8> {
// SIMD instructions force us to align data since we iterate each 4 elements
// So:
// n (0) => 0
// n (1 or 2 or 3 or 4) => 4,
// n (5) => 8
// and so on
let len = src.len();
let n = (len + 3) & !3; // align data
let mut dst: Vec<u8> = vec![0; n];
let mut offset = _mm_setzero_ps();
let sign_mask = _mm_set1_ps(-0.);
let mask = _mm_set1_epi32(0x0c080400);
for i in (0..n).step_by(4) {
let mut x = _mm_loadu_ps(&src[i]);
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
x = _mm_add_ps(x, offset);
let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
y = _mm_min_ps(y, _mm_set1_ps(1.0));
y = _mm_mul_ps(y, _mm_set1_ps(255.0));
let mut z = _mm_cvttps_epi32(y);
z = _mm_shuffle_epi8(z, mask);
_mm_store_ss(mem::transmute(&dst[i]), _mm_castsi128_ps(z));
offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
}
dst.set_len(len); // we must return vec of the same length as src.len()
dst
} |
I defer to @raphlinus on this one. |
@6D65 On a quick skim, that looks good. I would definitely prefer a PR rather than trying to adapt it from this issue. Maybe the 128-bit one first, with a followup for the AVX (including benchmarks); the latter will certainly require more sophisticated run-time capability testing. |
…eted build.rs and the C code
@raphlinus submitted a pull request from my other account. I have left the feature Also, I modified the render example to dump bmp files, as pgm is not quite supported on windows. Can make a pull request for that as well. |
Issue #9 : Switched to native SIMD instructions in Rust. Deleted build.rs and the C code
May I ask why this Issue is still open? Are we waiting for avx support before closing? Changing the render example from pgm to bmp should probably be a separate issue. |
Sorry, didn't realize this is fixed. Closing. |
Ref: rust-lang/rfcs#1639
The text was updated successfully, but these errors were encountered: