Rewrite SIMD intrinsic with Rust? #9

louy2 · 2016-08-05T18:27:01Z

kpp · 2018-01-27T11:28:37Z

@AdamNiederer that's a good challenge for your faster library!

6D65 · 2018-09-01T08:24:44Z

This is a pretty much mot-a-mot translation into Rust SS3 intrinsics. It compiles in stable and passes all the accumulate tests.

Can make a pull request, or @rtsuk || @raphlinus can add this to make it quicker.
Want to make an AVX2 version of this, to accumulate 8 values at once, don't see any reason it won't work.

    use std::mem;

    #[cfg(target_arch = "x86_64")]
     use std::arch::x86_64::*;

    #[cfg(target_arch = "x86")]
    use std::arch::x86::*;

    macro_rules! _mm_shuffle {
        ($z:expr, $y:expr, $x:expr, $w:expr) => {
            ($z << 6) | ($y << 4) | ($x << 2) | $w
        };
    }

    #[inline]
    #[target_feature(enable = "sse3")]
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    #[allow(unsafe_code)]
    pub unsafe fn accumulate_sse(input: &[f32], out: &mut Vec<u8>, n: usize) {
        let mut offset = _mm_setzero_ps();
        let sign_mask = _mm_set1_ps(-0.);
        let mask = _mm_set1_epi32(0x0c080400);

        for i in (0..n).step_by(4) {
            let mut x = _mm_loadu_ps(&input[i]);
            x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
            x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
            x = _mm_add_ps(x, offset);

            let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
            y = _mm_min_ps(y, _mm_set1_ps(1.0));
            y = _mm_mul_ps(y, _mm_set1_ps(255.0));

            let mut z = _mm_cvttps_epi32(y);
            z = _mm_shuffle_epi8(z, mask);

            _mm_store_ss(mem::transmute(&out[i]), _mm_castsi128_ps(z));
            offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
        }
    }

    fn accumulate(src: &[f32]) -> Vec<u8> {
        let len = src.len();
        let n = (len + 3) & !3; // align data
        let mut dst: Vec<u8> = vec![0; n]; // Vec::with_capacity(n) won't work here
        unsafe {
            accumulate_sse(src, &mut dst, n);
            dst.set_len(len); // we must return vec of the same length as src.len()
        }
        dst
    }

or merge the simd function with the top level one

use std::mem;

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

#[cfg(target_arch = "x86")]
use std::arch::x86::*;

macro_rules! _mm_shuffle {
    ($z:expr, $y:expr, $x:expr, $w:expr) => {
        ($z << 6) | ($y << 4) | ($x << 2) | $w
    };
}

#[inline]
#[cfg(feature = "sse")]
#[allow(unsafe_code)]
pub unsafe fn accumulate(src: &[f32]) -> Vec<u8> {
    // SIMD instructions force us to align data since we iterate each 4 elements
    // So:
    // n (0) => 0
    // n (1 or 2 or 3 or 4) => 4,
    // n (5) => 8
    // and so on
    let len = src.len();
    let n = (len + 3) & !3; // align data
    let mut dst: Vec<u8> = vec![0; n];
    let mut offset = _mm_setzero_ps();
    let sign_mask = _mm_set1_ps(-0.);
    let mask = _mm_set1_epi32(0x0c080400);

    for i in (0..n).step_by(4) {
        let mut x = _mm_loadu_ps(&src[i]);
        x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
        x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
        x = _mm_add_ps(x, offset);

        let mut y = _mm_andnot_ps(sign_mask, x); // fabs(x)
        y = _mm_min_ps(y, _mm_set1_ps(1.0));
        y = _mm_mul_ps(y, _mm_set1_ps(255.0));

        let mut z = _mm_cvttps_epi32(y);
        z = _mm_shuffle_epi8(z, mask);

        _mm_store_ss(mem::transmute(&dst[i]), _mm_castsi128_ps(z));
        offset = _mm_shuffle_ps(x, x, _mm_shuffle!(3, 3, 3, 3));
    }

    dst.set_len(len); // we must return vec of the same length as src.len()

    dst
}

rtsuk · 2018-09-01T19:24:35Z

I defer to @raphlinus on this one.

raphlinus · 2018-09-01T20:03:15Z

@6D65 On a quick skim, that looks good. I would definitely prefer a PR rather than trying to adapt it from this issue. Maybe the 128-bit one first, with a followup for the AVX (including benchmarks); the latter will certainly require more sophisticated run-time capability testing.

…eted build.rs and the C code

6D65 · 2018-09-02T07:59:37Z

@raphlinus submitted a pull request from my other account. I have left the feature sse in place, I assume it's handy for benchmarking simd vs non-simd, but other than that, can't think of a reason to keep it.

Also, I modified the render example to dump bmp files, as pgm is not quite supported on windows. Can make a pull request for that as well.

Issue #9 : Switched to native SIMD instructions in Rust. Deleted build.rs and the C code

johannesvollmer · 2019-02-18T12:40:08Z

May I ask why this Issue is still open? Are we waiting for avx support before closing? Changing the render example from pgm to bmp should probably be a separate issue.

louy2 · 2019-03-06T22:54:27Z

Sorry, didn't realize this is fixed. Closing.

codri added a commit to codri/font-rs that referenced this issue Sep 2, 2018

Issue raphlinus#9 : Switched to native SIMD instructions in Rust. Del…

b382105

…eted build.rs and the C code

codri mentioned this issue Sep 2, 2018

Issue #9 : Switched to native SIMD instructions in Rust. Deleted build.rs and the C code #28

Merged

raphlinus added a commit that referenced this issue Sep 18, 2018

Merge pull request #28 from codri/rust_native_simd

f600da7

Issue #9 : Switched to native SIMD instructions in Rust. Deleted build.rs and the C code

louy2 closed this as completed Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite SIMD intrinsic with Rust? #9

Rewrite SIMD intrinsic with Rust? #9

louy2 commented Aug 5, 2016

kpp commented Jan 27, 2018

6D65 commented Sep 1, 2018 •

edited

Loading

rtsuk commented Sep 1, 2018

raphlinus commented Sep 1, 2018

6D65 commented Sep 2, 2018

johannesvollmer commented Feb 18, 2019

louy2 commented Mar 6, 2019

Rewrite SIMD intrinsic with Rust? #9

Rewrite SIMD intrinsic with Rust? #9

Comments

louy2 commented Aug 5, 2016

kpp commented Jan 27, 2018

6D65 commented Sep 1, 2018 • edited Loading

rtsuk commented Sep 1, 2018

raphlinus commented Sep 1, 2018

6D65 commented Sep 2, 2018

johannesvollmer commented Feb 18, 2019

louy2 commented Mar 6, 2019

6D65 commented Sep 1, 2018 •

edited

Loading