Skip to content

generic @llvm.ssub.sat optimizes less well than target-specific @llvm.aarch64.neon.sqsub #94463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
folkertdev opened this issue Jun 5, 2024 · 10 comments

Comments

@folkertdev
Copy link

The generic @llvm.ssub.sat.v2i64 intrinsic optimizes less well than the target-specific @llvm.aarch64.neon.sqsub.v2i64 intrinsic.

This godbolt shows the issue https://godbolt.org/z/4qEe3xM9v

We see two functions that use these two saturating subtractions, but are otherwise the same. they generate very similar initial LLVM IR, except that the generic function is called in a slightly different way (requiring some extra allocas).

define void @specific(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  %3 = alloca [16 x i8], align 16
  %4 = alloca [16 x i8], align 16
  call void @llvm.lifetime.start.p0(i64 16, ptr %4)
  %5 = load <4 x i32>, ptr %b, align 16
  store <4 x i32> %5, ptr %3, align 16
  %6 = load <4 x i32>, ptr %c, align 16
  store <4 x i32> %6, ptr %2, align 16
  call void @core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %4, ptr noalias nocapture noundef align 16 dereferenceable(16) %3, ptr noalias nocapture noundef align 16 dereferenceable(16) %2)
  %_4 = load <2 x i64>, ptr %4, align 16
  call void @llvm.lifetime.end.p0(i64 16, ptr %4)
  %7 = load <2 x i64>, ptr %a, align 16
  store <2 x i64> %7, ptr %1, align 16
  store <2 x i64> %_4, ptr %0, align 16
  ; after InlinerPass this call becomes a call to `@llvm.aarch64.neon.sqsub.v2i64` 
  call void @core::core_arch::arm_shared::neon::generated::vqsubq_s64::h1887dd6c0650937c(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  ret void
}

define void @generic(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  call void @llvm.lifetime.start.p0(i64 16, ptr %2)
  %3 = load <4 x i32>, ptr %b, align 16
  store <4 x i32> %3, ptr %1, align 16
  %4 = load <4 x i32>, ptr %c, align 16
  store <4 x i32> %4, ptr %0, align 16
  call void @core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %2, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  %_4 = load <2 x i64>, ptr %2, align 16
  call void @llvm.lifetime.end.p0(i64 16, ptr %2)
  %5 = load <2 x i64>, ptr %a, align 16
  %6 = call <2 x i64> @llvm.ssub.sat.v2i64(<2 x i64> %5, <2 x i64> %_4)
  store <2 x i64> %6, ptr %_0, align 16
  ret void
}

As a user, I expect the generic variant to eventually be lowered to the specific one.

When the intrinsics are used on their own, this is in fact the case: https://godbolt.org/z/ErjETo3bh. Both functions emit the sqsub instruction. This logic appears to be implemented here.

But in my example, there is an optimization that @llvm.aarch64.neon.sqsub.v2i64 participates in, but the generic @llvm.ssub.sat.v2i64 does not.

specific:
        ldr     q0, [x1]
        ldr     q1, [x2]
        ldr     q2, [x0]
        sqdmlsl2        v2.2d, v0.4s, v1.s[1]
        str     q2, [x8]
        ret

generic:
        ldr     q0, [x1]
        ldr     q1, [x2]
        sqdmull2        v0.2d, v0.4s, v1.s[1]
        ldr     q1, [x0]
        sqsub   v0.2d, v1.2d, v0.2d
        str     q0, [x8]
        ret

This is unexpected, and seems to imply that many optimizations are missed when using the generic SIMD intrinsics (at least for Neon). My intuition is that the lowering of the generic to the specific instruction occurs too late. It should occur earlier so that it can participate in backend-specific optimizations.

@llvmbot
Copy link
Member

llvmbot commented Jun 5, 2024

@llvm/issue-subscribers-backend-aarch64

Author: Folkert de Vries (folkertdev)

The generic `@llvm.ssub.sat.v2i64` intrinsic optimizes less well than the target-specific `@llvm.aarch64.neon.sqsub.v2i64` intrinsic.

This godbolt shows the issue https://godbolt.org/z/4qEe3xM9v

We see two functions that use these two saturating subtractions, but are otherwise the same. they generate very similar initial LLVM IR, except that the generic function is called in a slightly different way (requiring some extra allocas).

define void @<!-- -->specific(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  %3 = alloca [16 x i8], align 16
  %4 = alloca [16 x i8], align 16
  call void @<!-- -->llvm.lifetime.start.p0(i64 16, ptr %4)
  %5 = load &lt;4 x i32&gt;, ptr %b, align 16
  store &lt;4 x i32&gt; %5, ptr %3, align 16
  %6 = load &lt;4 x i32&gt;, ptr %c, align 16
  store &lt;4 x i32&gt; %6, ptr %2, align 16
  call void @<!-- -->core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %4, ptr noalias nocapture noundef align 16 dereferenceable(16) %3, ptr noalias nocapture noundef align 16 dereferenceable(16) %2)
  %_4 = load &lt;2 x i64&gt;, ptr %4, align 16
  call void @<!-- -->llvm.lifetime.end.p0(i64 16, ptr %4)
  %7 = load &lt;2 x i64&gt;, ptr %a, align 16
  store &lt;2 x i64&gt; %7, ptr %1, align 16
  store &lt;2 x i64&gt; %_4, ptr %0, align 16
  ; after InlinerPass this call becomes a call to `@<!-- -->llvm.aarch64.neon.sqsub.v2i64` 
  call void @<!-- -->core::core_arch::arm_shared::neon::generated::vqsubq_s64::h1887dd6c0650937c(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  ret void
}

define void @<!-- -->generic(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  call void @<!-- -->llvm.lifetime.start.p0(i64 16, ptr %2)
  %3 = load &lt;4 x i32&gt;, ptr %b, align 16
  store &lt;4 x i32&gt; %3, ptr %1, align 16
  %4 = load &lt;4 x i32&gt;, ptr %c, align 16
  store &lt;4 x i32&gt; %4, ptr %0, align 16
  call void @<!-- -->core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %2, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  %_4 = load &lt;2 x i64&gt;, ptr %2, align 16
  call void @<!-- -->llvm.lifetime.end.p0(i64 16, ptr %2)
  %5 = load &lt;2 x i64&gt;, ptr %a, align 16
  %6 = call &lt;2 x i64&gt; @<!-- -->llvm.ssub.sat.v2i64(&lt;2 x i64&gt; %5, &lt;2 x i64&gt; %_4)
  store &lt;2 x i64&gt; %6, ptr %_0, align 16
  ret void
}

As a user, I expect the generic variant to eventually be lowered to the specific one.

When the intrinsics are used on their own, this is in fact the case: https://godbolt.org/z/ErjETo3bh. Both functions emit the sqsub instruction. This logic appears to be implemented here.

But in my example, there is an optimization that @<!-- -->llvm.aarch64.neon.sqsub.v2i64 participates in, but the generic @<!-- -->llvm.ssub.sat.v2i64 does not.

specific:
        ldr     q0, [x1]
        ldr     q1, [x2]
        ldr     q2, [x0]
        sqdmlsl2        v2.2d, v0.4s, v1.s[1]
        str     q2, [x8]
        ret

generic:
        ldr     q0, [x1]
        ldr     q1, [x2]
        sqdmull2        v0.2d, v0.4s, v1.s[1]
        ldr     q1, [x0]
        sqsub   v0.2d, v1.2d, v0.2d
        str     q0, [x8]
        ret

This is unexpected, and seems to imply that many optimizations are missed when using the generic SIMD intrinsics (at least for Neon). My intuition is that the lowering of the generic to the specific instruction occurs too late. It should occur earlier so that it can participate in backend-specific optimizations.

@folkertdev
Copy link
Author

this is still a problem, here is a pure LLVM reproduction https://godbolt.org/z/c8eshf1a1

I believe the issue is here

defm SQDMULL : SIMDThreeScalarMixedHS<0, 0b11010, "sqdmull",
int_aarch64_neon_sqdmulls_scalar>;
defm SQDMLAL : SIMDThreeScalarMixedTiedHS<0, 0b10010, "sqdmlal">;
defm SQDMLSL : SIMDThreeScalarMixedTiedHS<0, 0b10110, "sqdmlsl">;
def : Pat<(i64 (int_aarch64_neon_sqadd (i64 FPR64:$Rd),
(i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
(i32 FPR32:$Rm))))),
(SQDMLALi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;
def : Pat<(i64 (int_aarch64_neon_sqsub (i64 FPR64:$Rd),
(i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
(i32 FPR32:$Rm))))),
(SQDMLSLi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;

which does not account for the platform-independent intrinsics. However, the actual fix might be to lower to the platform-specific intrinsic earlier?

@SpencerAbson
Copy link
Contributor

The interesting patterns here are part of SIMDIndexedLongSQDMLXSDTied (notably the use of Accum).

We could lower aarch64_neon_{u,s}q{add,sub} to their equivalent ISD::{U,S}{ADD,SUB}SAT target independent nodes, this might be worthwhile in it's own right (any thoughts @davemgreen ?).

That said, AArch64 only supports vector types with these target independent nodes [1], so we can't remove the intrinsics entirely since they are used to support scalar arguments. This also wouldn't be GISel friendly.

A simpler approach would be to PatFrag the intrinsic and target independent nodes together, in a similar way to AArch64uabd, though the former suggestion would be a more valuable change.

[1] https://reviews.llvm.org/D69374

@davemgreen
Copy link
Collaborator

Yeah, we should be able to convert the vector types over:

case Intrinsic::aarch64_neon_sqadd:
  if (Op.getValueType().isVector())
    return DAG.getNode(ISD::SADDSAT, dl, Op.getValueType(), Op.getOperand(1),
                       Op.getOperand(2));
  return SDValue();

@folkertdev
Copy link
Author

@davemgreen that lowering already happens I believe. It is specifically the fusing with the llvm.aarch64.neon.sqdmull that does not happen when using the generic saturating add/sub.

The snippet you posted actually breaks that fusing for @llvm.aarch64.neon.sqadd.*.

@folkertdev
Copy link
Author

I also tried adding this additional rule

def : Pat<(i64 (int_sadd_sat (i64 FPR64:$Rd),
                   (i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
                                                        (i32 FPR32:$Rm))))),
          (SQDMLALi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;

but that does not actually trigger from what I can see by running this test

define <2 x i64> @test_vqdmlal_lane_s32_foo(<2 x i64> %a, <2 x i32> %b, <2 x i32> %v) {
; CHECK-LABEL: test_vqdmlal_lane_s32_foo:
; CHECK:       // %bb.0: // %entry
; CHECK-NEXT:    // kill: def $d2 killed $d2 def $q2
; CHECK-NEXT:    sqdmlal v0.2d, v1.2s, v2.s[1]
; CHECK-NEXT:    ret
entry:
  %shuffle = shufflevector <2 x i32> %v, <2 x i32> undef, <2 x i32> <i32 1, i32 1>
  %vqdmlal2.i = tail call <2 x i64> @llvm.aarch64.neon.sqdmull.v2i64(<2 x i32> %b, <2 x i32> %shuffle)
  %vqdmlal4.i = tail call <2 x i64> @llvm.sadd.sat.v2i64(<2 x i64> %a, <2 x i64> %vqdmlal2.i)
  ret <2 x i64> %vqdmlal4.i
}

this still does not fuse

          1290:  .p2align 2 
          1291:  .type test_vqdmlal_lane_s32_foo,@function 
          1292: test_vqdmlal_lane_s32_foo: // @test_vqdmlal_lane_s32_foo 
          1293:  .cfi_startproc 
          1294: // %bb.0: // %entry 
          1295:  // kill: def $d2 killed $d2 def $q2 
next:1688'0                                         X error: no match found
          1296:  sqdmull v1.2d, v1.2s, v2.s[1] 
next:1688'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
next:1688'1      ?                              possible intended match
          1297:  sqadd v0.2d, v0.2d, v1.2d 
next:1688'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~

@davemgreen
Copy link
Collaborator

davemgreen commented Apr 30, 2025

@davemgreen that lowering already happens I believe. It is specifically the fusing with the llvm.aarch64.neon.sqdmull that does not happen when using the generic saturating add/sub.

The snippet you posted actually breaks that fusing for @llvm.aarch64.neon.sqadd.*.

It would need to update the tablegen patterns too, but those should now all reliably be sadd.sat intrinsics, so there would no longer need to be patterns for both. The idea is to make sure that when we get to selection, llvm.sadd.sat and aarch64.neon.sqadd look the same so we only need one set of patterns (some of the duplicates can be removed, and the existing patterns for sqadd(sqmull) will apply to everything). Doing it for vector should be simple enough, doing it for scalar looks more difficult.

I would probably recommend getting the vector code working first, then the scalar patterns we can try and handle later if needed. They are more difficult if the scalar types are not legal or the operation should happen on a fpr register.

@folkertdev
Copy link
Author

The idea is to make sure that when we get to selection, llvm.sadd.sat and aarch64.neon.sqadd look the same so we only need one set of patterns

Right, I misunderstood that at first, but specifically you want to make aarch64.neon.sqadd look like llvm.sadd.sat, right? (is there a reason to prefer that to the opposite?).

When I try that, the existing pattern no longer matches, and I'm not sure what pattern to use to make it match again. Also just generally, I have no idea what I'm doing here (I would just like this to work so rust's stdarch can be a little simpler), so if this is a quick fix for you, please make it :)

@davemgreen
Copy link
Collaborator

Going sqadd->sadd.sat has the benefit that it might lead to some optimizations (like constant folding) that would not be performed otherwise on the target specific intrinsics. (Ideally they could be removed earlier in the front-end, but there is a consideration (that we don't support at the moment) with saturating intrinsics setting qc flags, so it is more contentious than the rest).

Can you take a look at SpencerAbson@330fcff and update it, and hopefully add some basic GISel support so they don't fall back? It sounds like you are already on the same path, but that will hopefully get you further along it and seems to have some useful tests in it.

@folkertdev
Copy link
Author

Going sqadd->sadd.sat has the benefit that it might lead to some optimizations (like constant folding) that would not be performed otherwise on the target specific intrinsics.

Neat!

Can you take a look at SpencerAbson/llvm-project@330fcff and update it

Okay yes that is very helpful.

add some basic GISel support so they don't fall back?

I'm not sure exactly what you mean here. I think you want to prevent llvm.sadd.sat from being rewritten to aarch64.neon.sqadd?

However, the string sqadd does not occur in llvm/lib/Target/AArch64/GISel/ at all, so I'm not sure what that would look like.

SADDSAT occurs just once, here:

getActionDefinitionsBuilder({G_UADDSAT, G_SADDSAT, G_USUBSAT, G_SSUBSAT})
.legalFor({v8s8, v16s8, v4s16, v8s16, v2s32, v4s32, v2s64})
.legalFor(HasSVE, {nxv16s8, nxv8s16, nxv4s32, nxv2s64})
.clampNumElements(0, v8s8, v16s8)
.clampNumElements(0, v4s16, v8s16)
.clampNumElements(0, v2s32, v4s32)
.clampMaxNumElements(0, s64, 2)
.scalarizeIf(scalarOrEltWiderThan(0, 64), 0)
.moreElementsToNextPow2(0)
.lower();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants