Skip to content

Rewrite SIMD intrinsic with Rust? #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
louy2 opened this issue Aug 5, 2016 · 7 comments
Closed

Rewrite SIMD intrinsic with Rust? #9

louy2 opened this issue Aug 5, 2016 · 7 comments

Comments

@louy2
Copy link

louy2 commented Aug 5, 2016

Ref: rust-lang/rfcs#1639

@kpp
Copy link
Contributor

kpp commented Jan 27, 2018

@AdamNiederer that's a good challenge for your faster library!

@6D65
Copy link

6D65 commented Sep 1, 2018

This is a pretty much mot-a-mot translation into Rust SS3 intrinsics. It compiles in stable and passes all the accumulate tests.

Can make a pull request, or @rtsuk || @raphlinus can add this to make it quicker.
Want to make an AVX2 version of this, to accumulate 8 values at once, don't see any reason it won't work.

    use std::mem;

    #[cfg(target_arch = "x86_64")]
     use std::arch::x86_64::*;

    #[cfg(target_arch = "x86")]
    use std::arch::x86::*;

    macro_rules! _mm_shuffle {
        ($z:expr, $y:expr, $x:expr, $w:expr) => {
            ($z << 6) | ($y << 4) | ($x << 2) | $w
        };
    }

    #[inline]
    #[target_feature(enable = "sse3")]
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    #[allow(unsafe_code)]
    pub unsafe fn accumulate_sse(input: &[f32], out: &mut Vec<u8>, n: usize) {
        let mut offset = _mm_setzero_ps();
        let sign_mask = _mm_set1_ps(-0.);
        let mask = _mm_set1_epi32(0x0c080400);

        for i in (0..n).step_by(4) {
            let mut x = _mm_loadu_ps(&input[i]);
            x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
            x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
            x = _mm_add_ps(x, offset);

            let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
            y = _mm_min_ps(y, _mm_set1_ps(1.0));
            y = _mm_mul_ps(y, _mm_set1_ps(255.0));

            let mut z = _mm_cvttps_epi32(y);
            z = _mm_shuffle_epi8(z, mask);

            _mm_store_ss(mem::transmute(&out[i]), _mm_castsi128_ps(z));
            offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
        }
    }

    fn accumulate(src: &[f32]) -> Vec<u8> {
        let len = src.len();
        let n = (len + 3) & !3; // align data
        let mut dst: Vec<u8> = vec![0; n]; // Vec::with_capacity(n) won't work here
        unsafe {
            accumulate_sse(src, &mut dst, n);
            dst.set_len(len); // we must return vec of the same length as src.len()
        }
        dst
    }

or merge the simd function with the top level one

use std::mem;

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

#[cfg(target_arch = "x86")]
use std::arch::x86::*;

macro_rules! _mm_shuffle {
    ($z:expr, $y:expr, $x:expr, $w:expr) => {
        ($z << 6) | ($y << 4) | ($x << 2) | $w
    };
}

#[inline]
#[cfg(feature = "sse")]
#[allow(unsafe_code)]
pub unsafe fn accumulate(src: &[f32]) -> Vec<u8> {
    // SIMD instructions force us to align data since we iterate each 4 elements
    // So:
    // n (0) => 0
    // n (1 or 2 or 3 or 4) => 4,
    // n (5) => 8
    // and so on
    let len = src.len();
    let n = (len + 3) & !3; // align data
    let mut dst: Vec<u8> = vec![0; n];
    let mut offset = _mm_setzero_ps();
    let sign_mask = _mm_set1_ps(-0.);
    let mask = _mm_set1_epi32(0x0c080400);

    for i in (0..n).step_by(4) {
        let mut x = _mm_loadu_ps(&src[i]);
        x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
        x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
        x = _mm_add_ps(x, offset);

        let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
        y = _mm_min_ps(y, _mm_set1_ps(1.0));
        y = _mm_mul_ps(y, _mm_set1_ps(255.0));

        let mut z = _mm_cvttps_epi32(y);
        z = _mm_shuffle_epi8(z, mask);

        _mm_store_ss(mem::transmute(&dst[i]), _mm_castsi128_ps(z));
        offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
    }

    dst.set_len(len); // we must return vec of the same length as src.len()

    dst
}

@rtsuk
Copy link
Collaborator

rtsuk commented Sep 1, 2018

I defer to @raphlinus on this one.

@raphlinus
Copy link
Owner

@6D65 On a quick skim, that looks good. I would definitely prefer a PR rather than trying to adapt it from this issue. Maybe the 128-bit one first, with a followup for the AVX (including benchmarks); the latter will certainly require more sophisticated run-time capability testing.

@6D65
Copy link

6D65 commented Sep 2, 2018

@raphlinus submitted a pull request from my other account. I have left the feature sse in place, I assume it's handy for benchmarking simd vs non-simd, but other than that, can't think of a reason to keep it.

Also, I modified the render example to dump bmp files, as pgm is not quite supported on windows. Can make a pull request for that as well.

raphlinus added a commit that referenced this issue Sep 18, 2018
Issue #9 : Switched to native SIMD instructions in Rust. Deleted build.rs and the C code
@johannesvollmer
Copy link

May I ask why this Issue is still open? Are we waiting for avx support before closing? Changing the render example from pgm to bmp should probably be a separate issue.

@louy2
Copy link
Author

louy2 commented Mar 6, 2019

Sorry, didn't realize this is fixed. Closing.

@louy2 louy2 closed this as completed Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants