Skip to content

Conversation

@calebzulawski
Copy link
Member

It's a little confusing for the mask element type to be generic over "equivalently sized" integers, rather than the actual element type. This PR changes e.g. Mask<isize, N> to Mask<*const u8, N>, or Mask<usize, N>, or whatever the element type actually is. Another PR in the future could probably remove the MaskElement trait entirely, but I think this is a good start.

Also, I removed the From implementation for converting masks, because with this many valid element types it's not really reasonable to implement.

@thomcc
Copy link
Member

thomcc commented Dec 14, 2022

Also, I removed the From implementation for converting masks, because with this many valid element types it's not really reasonable to implement.

This seems like a pretty big downside to this approach...

@calebzulawski
Copy link
Member Author

calebzulawski commented Dec 14, 2022

There is still the cast function, which is probably more ergonomic because you can do something like mask.cast::<u8>(). This corresponds to how Simd is converted.

Take a look at how From was previously implemented--it's manually implemented over every combination of types, so it explodes quadratically. This is only a problem because we can't write the implementation impl<T, U, const N: usize> From<Mask<T, N>> for Mask<U, N> that conflicts with the blanket implementation.

@programmerjake
Copy link
Member

This is only a problem because we can't write the implementation impl<T, U, const N: usize> From<Mask<T, N>> for Mask<U, N> that conflicts with the blanket implementation.

this makes me think Rust needs where T != U bounds, so we can implement From for everything not covered by From<T> for T

Copy link
Member

@programmerjake programmerjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes look mostly good, though imho these changes will make working with masks much more verbose, i'm not sure if this is a good idea...

pub fn gather_select(
slice: &[T],
enable: Mask<isize, LANES>,
enable: Mask<usize, LANES>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note x86 has mask sizes match the data size, not the index/address size...we should probably match that:
https://www.felixcloutier.com/x86/vgatherdps:vgatherqps#vgatherqps--vex-128-version-

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #323 to track this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. With this change it would be easy to make this take Mask<T, LANES> instead, but I'll leave that to a separate PR.

@calebzulawski
Copy link
Member Author

I don't think it will be "much" more verbose--in most situations the mask matches the vector type. In the cases where it doesn't, I think you're still typically going to do all of your operations with a single mask type, and have some casts before or after (much like into).

It could definitely make some code more verbose, but I think this is an acceptable tradeoff because it doesn't have any performance implications, and I think it's significantly easier to explain Mask<*const T, N> vs Mask<isize, N>, not to mention it carries much more semantic meaning. We've already seen some confusion as to what the mask element type is and why.

@workingjubilee
Copy link
Member

Hmm. Having to drop From really hurts, still, and I'm a little worried about all the extra instances of .cast() that have to be thrown around. What are some examples of code that this makes simpler? Or is there any code that would be compiled better with this suite of implementations?

@calebzulawski
Copy link
Member Author

calebzulawski commented Dec 19, 2022

In retrospect, I wouldn't even implement From with our current masks (the number of implementations doesn't sit right with me, and they just call cast anyway). If that's the dealbreaker I can still implement it for this PR.

This change doesn't affect compilation, the layouts etc are still identical. As far as simpler I'm not aiming for "less verbose" but "less cognitive load". Imagine this as a new std::simd user who might not even be particularly well versed in SIMD:

fn foo(x: Simd<f32, 4>, p: Simd<*const u8, 4>) -> Simd<*const u8, 4> {
    let mask: Mask<f32, 4> = x.is_nan(); // would this make more sense as Mask<i32, 4> when there's no i32 anywhere?
    mask.cast::<*const u8>().select(Simd::splat(std::ptr::null()), p) // what about isize here?
}

IMO using signed integers also implies that the masks are just vectors (I know we document otherwise, but it's not helping). I think sometimes requiring a cast for select etc, but not always, cements in the API that cast is expensive, rather than target-specific. Considering all of the newer instruction sets seem to use bitmasks (where cast is always completely free), I don't think that's the right implication.

@calebzulawski
Copy link
Member Author

calebzulawski commented Dec 19, 2022

Just a silly example of this being a flawed hint in std::simd today, this is not so great on AVX (not AVX2). The f32 section drops to SSE (despite no cast hinting that something funny might happen). With more complex code it might still use AVX and require an extra move (cheap, but not free like an AVX-512 cast) to create two SSE masks:

fn foo(x: f32x8, y: i32x8, z: i32x8) -> i32x8 {
    x.is_sign_positive().select(y, y + z)
}

@programmerjake
Copy link
Member

programmerjake commented Dec 19, 2022

Just a silly example of this being a flawed hint in std::simd today, this is not so great on AVX (not AVX2).

i wouldn't blame masks for that, i'd instead blame 2 out of 3 of the operations you're trying to do not being supported by AVX, requiring AVX2:

  • is_sign_positive -- not actually a fp operation, is really transmute(v) >> 31 -- requires AVX2
  • i32x8::add -- not a fp operation -- requires AVX2

LLVM has therefore reasonably decided using SSE operations throughout is faster than using AVX load, conversion to SSE, SSE shift, conversion to AVX, AVX select, conversion to SSE, SSE int add, conversion to AVX, and finally AVX store.

if you change it to the following, it uses AVX operations throughout because fp comparisons and fp/int bitwise logic are fully supported by AVX:
https://rust.godbolt.org/z/nr484jdhv

pub fn foo(x: f32x8, y: i32x8, z: i32x8) -> i32x8 {
    x.simd_gt(Simd::splat(0.0f32)).select(y, y ^ z)
}

The f32 section drops to SSE (despite no cast hinting that something funny might happen). With more complex code it might still use AVX and require an extra move (cheap, but not free like an AVX-512 cast) to create two SSE masks:

note that the SSE <-> AVX moves are because of moving data from the high 128 bits to a separate register so it can be used for SSE ops since SSE instructions can only read/write the lower 128 bits (they technically can write zeros to the high 128-bits if encoded using the AVX encoding), it has nothing to do with that data being a Mask or not.

This mess is all caused by Intel deciding the first AVX and SSE extensions only need fp and bitwise ops, no integer ops until AVX2/SSE2. IMHO it's completely reasonable to not use AVX at all unless AVX2 is available.

@calebzulawski
Copy link
Member Author

I don't disagree with any of that--my point is that "f32 and i32 should use the same mask type because they are always compatible" isn't quite true. Any particular architecture will have varying support for different element types (altivec and v7 neon not supporting f64 is another example). I just don't think the API should be so opinionated to particularly accommodate some architectures.

Copy link
Member

@programmerjake programmerjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, your point that optimal mask types can vary based on element type, not just element size is a good one...

@programmerjake
Copy link
Member

we'll probably want to wait and see what others think of your point before merging

@workingjubilee
Copy link
Member

I'm still kinda mulling this over and I agree that what you cite is a bit of an oddity.

I don't think the From implementations are a dealbreaker. Maybe Thom would?

I agree that ideally we would have something that encourages either not caring much about the architecture's specifics or being aware that the architecture's specifics are... well, specific. Hm.

@thomcc
Copy link
Member

thomcc commented Dec 28, 2022

Maybe Thom would?

No, I don't think it's a dealbreaker. Their absence will be badly missed, though...

@calebzulawski
Copy link
Member Author

calebzulawski commented Jan 21, 2023

I found a trick with macros to implement all of the scalars, but still ran into an issue with:

impl<T, U, const LANES: usize> From<Mask<*const T, LANES>> for Mask<*const U, LANES>

because this still overlaps with From<T> for T.

How does everyone feel about merging this as-is, and hopefully getting From in the future?

@programmerjake
Copy link
Member

How does everyone feel about merging this as-is, and hopefully getting From in the future?

sounds ok to me, as long as others are fine with it.

@workingjubilee
Copy link
Member

Apologies for the delay in response. I have thought it over and I think this points to a need to revise our approach to Masks on a more fundamental level (sadly) (again!) but I have no objection to this change as-is.

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Thom approves I will go ahead and merge this.

@thomcc
Copy link
Member

thomcc commented Mar 11, 2023

I'm not a fan. Don't get me wrong, I don't love the old design either... but I think this might be the wrong direction. Here are some specific concerns around it:

  • This change is already causing us friction in our APIs (e.g From can't be implemented the way we would like), and I would be unsurprised if user code wouldn't hit issue in their own traits after the change.

  • From the patch diff, it seems likely users of std::simd will end needing to use complicated types and projections in their functions and/or signatures like Simd<<T as SimdElement>::Mask, LANES> or <Simd<T, LANES> as SimdPartialEq>::Mask in order to express the operations they want to perform. This is very complex, and IMO needs much stronger motivation than given.

  • Several cases where Mask::foo used to turbofish it now fails to longer being able to be inferred without additional turbofishing. This will probably result in confusion and bad error messages.

A less specific (more vibes-based) concern is that each additional level of genericity and type-level computation we perform here complicates error messages, slows compiles, and monomorphizations users will have to suffer through.

My feeling and experience is that that highly generic APIs like where this is headed are very divisive in the community. Things like rand, diesel, etc are great pieces of engineering. This isn't at those levels yet, for sure but this is pushing us further beyond anything currently in the stdlib (with the possible exception of the std::ops machinery for ?, which is a smaller surface and has some tricky compatibility constraints).


As for the argument that perhaps an ISA will have different mask types based on vector element type, I think it is uncompelling. IMO it's a bad pattern to try to design for every possible hypothetical, as it's an unbounded set. A new SIMD ISA could be released that behaves in any possible way, and even if it's narrowed down to APIs that seem plausible, you still risk making the code of every user of the API more complicated for the benefit of something which may never exist. IOW, at a certain point, I think it's fine to say "std::arch is still available"1.

That said, for this specific case I'm not convinced the change would make a difference even if such an ISA comes to pass. Concretely, compare this with mask representations that vary based on operation2, rather than the more common things like elem size, lane count, and so on. We don't worry about this so much because we assume that most of the time the API usage should be inlined, and if things are inlined we're hoping that LLVM can handle it3. ISTM like the same logic should apply.

P.S. Very tired, apologies for any typos. Comment made anyway to avoid this landing without me saying something.

Footnotes

  1. Hell, I already have to use std::arch just get good performance out of std::simd in many cases...

  2. That is, instruction used; some may use bitmasks, others full size masks, etc.

  3. if it can remains to be seen in both cases, admittedly...

@programmerjake
Copy link
Member

  • From the patch diff, it seems likely users of std::simd will end needing to use complicated types and projections in their functions and/or signatures like Simd<<T as SimdElement>::Mask, LANES> or <Simd<T, LANES> as SimdPartialEq>::Mask in order to express the operations they want to perform. This is very complex, and IMO needs much stronger motivation than given.

imho it's the other way around, currently we tend to require Mask<T::Mask, N> to be able to mask Simd<T, N> operations, with this we can just use Mask<T, N> for Simd<T, N>, no complex associated types needed by the users.

@workingjubilee
Copy link
Member

workingjubilee commented Mar 11, 2023

Mmk.

I agree that we should go back to the drawing board on this part, honestly. I just was fine with trying to implement a redesign on top of a HEAD with this commit.

I've been experimenting with bits of redesigns that use associated types, but in ways that more strongly bind the operations we are allowing to the types of Simd without losing genericity over lanes, and with, yes, less imports and projections needed by users (the signature might look slightly jank in std but... uh, nothing Diesel-tier). We've gained some powerful tools recently, we should start using them and trying to retackle the problems we kind of fudged earlier.

I think we probably need more code examples in this repo of using our own API so that the consequences of our diffs will be more immediately obvious on user code to help settle this kind of concern in the future, rather than feeling like we have to hemm and haww for weeks. It should be more transparent whether something feels like a good or bad idea based on those examples.

@calebzulawski
Copy link
Member Author

Regarding error messages etc, I can't see anything more straightforward than "the mask matches the vector element". We have already seen examples of people misunderstanding the masks and assuming the mask matches the vector and seeing very strange messages like "expected i64, got u64" and the thought process isn't "maybe masks have different element types" but instead "why does the compiler think my u64x4 is actually i64x4?".

It would be nice if masks didn't depend on the vector element, but there's nothing we can do about that. If they must depend on the element type, this actually seems the least abstracted to me.

I agree with @programmerjake regarding generics, I'm in the process of implementing num-traits for vectors and currently need Mask<<T as SimdElement>::Mask, N>, but with this change would be able to do simply Mask<T, N>.

@workingjubilee
Copy link
Member

If the real splitting point is on error messages, then maybe we need some ui test examples, then? We should do what we need to do to feel confident shipping stuff.

@jhorstmann
Copy link

As for the argument that perhaps an ISA will have different mask types based on vector element type, I think it is uncompelling.

I wonder how difficult it would be to add support for #[repr(simd)] for [bool; N] in rustc, and whether that would sidestep any issues with the mask element type. The array would have no layout guarantees, and llvm codegen should treat that as <N x i1>.

AFAIK, in llvm codegen any masks are truncated anyway to i1 elements before they can be used by llvm intrinsics. If the mask is generated programmatically, llvm should be free to use best mask type depending on target and usage.

@workingjubilee
Copy link
Member

I wonder how difficult it would be to add support for #[repr(simd)] for [bool; N] in rustc, and whether that would sidestep any issues with the mask element type. The array would have no layout guarantees, and llvm codegen should treat that as .

@jhorstmann I have been working on an RFC and implementation of generic integers directly into rustc and the language so that we can simply use that (but I have been frying a lot of fish, lately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants