[AArch64] Improve lowering of scalar abs(sub(a, b)). #151180

rj-jesus · 2025-07-29T16:09:01Z

This patch avoids a comparison against zero when lowering abs(sub(a, b)) patterns, instead reusing the condition codes generated by a subs of the operands directly.

For example, currently:

  sxtb w8, w0
  sub w8, w8, w1, sxtb
  cmp w8, #0
  cneg w0, w8, mi

becomes:

  sxtb w8, w0
  subs w8, w8, w1, sxtb
  cneg w0, w8, mi

Together with #151177, this should handle the remaining patterns in #118413 (although I believe now some of the tgt cases in the issue should generate slightly worse code than the src cases, presumably because the former are not combined as abd[us] nodes, but I believe these can be taken on separately).

llvmbot · 2025-07-29T16:09:35Z

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

This patch avoids a comparison against zero when lowering abs(sub(a, b)) patterns, instead reusing the condition codes generated by a subs of the operands directly.

For example, currently:

  sxtb w8, w0
  sub w8, w8, w1, sxtb
  cmp w8, #<!-- -->0
  cneg w0, w8, pl

becomes:

  sxtb w8, w0
  sxtb w9, w1
  subs w8, w9, w8
  cneg w0, w8, gt

Whilst this doesn't decrease the number of instructions, the new version should expose more ILP and use ``cheaper'' instructions having lower latency and/or higher throughput.

This patch also includes a somewhat orthogonal change in performNegCSelCombine, which I included in this patch to avoid a code generation regression in CodeGen/AArch64/abd[su]-neg.ll due to the combine negating the operands of the csel, leading to an extra sub instruction. If preferable, I can open a separate PR for this.

Together with #151177, this should handle the remaining patterns in #118413 (although I believe now some of the tgt cases in the issue should generate slightly worse code than the src cases, presumably because the former are not combined as abd[us] nodes, but I believe these can be taken on separately).

Full diff: https://github.com/llvm/llvm-project/pull/151180.diff

5 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+24-9)
(modified) llvm/test/CodeGen/AArch64/abds-neg.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/abds.ll (+37-37)
(modified) llvm/test/CodeGen/AArch64/abdu-neg.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/abdu.ll (+37-37)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7b49754ee7e1f..8b515fb649ee7 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -7118,12 +7118,21 @@ SDValue AArch64TargetLowering::LowerABS(SDValue Op, SelectionDAG &DAG) const {
     return LowerToPredicatedOp(Op, DAG, AArch64ISD::ABS_MERGE_PASSTHRU);
 
   SDLoc DL(Op);
-  SDValue Neg = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT),
-                            Op.getOperand(0));
-  // Generate SUBS & CSEL.
-  SDValue Cmp = DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, FlagsVT),
-                            Op.getOperand(0), DAG.getConstant(0, DL, VT));
-  return DAG.getNode(AArch64ISD::CSEL, DL, VT, Op.getOperand(0), Neg,
+  SDValue Val = Op.getOperand(0);
+  SDValue Neg = DAG.getNegative(Val, DL, VT);
+  SDValue Cmp;
+
+  // For abs(sub(lhs, rhs)), we can compare lhs and rhs directly. This allows
+  // reusing the subs operation for the calculation and comparison.
+  if (Val.getOpcode() == ISD::SUB)
+    Cmp = DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, FlagsVT),
+                      Val.getOperand(0), Val.getOperand(1));
+  else
+    // Otherwise, compare with zero.
+    Cmp = DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, FlagsVT), Val,
+                      DAG.getConstant(0, DL, VT));
+
+  return DAG.getNode(AArch64ISD::CSEL, DL, VT, Val, Neg,
                      DAG.getConstant(AArch64CC::PL, DL, MVT::i32),
                      Cmp.getValue(1));
 }
@@ -20835,11 +20844,17 @@ static SDValue performNegCSelCombine(SDNode *N, SelectionDAG &DAG) {
   if (!isNegatedInteger(N0) && !isNegatedInteger(N1))
     return SDValue();
 
-  SDValue N0N = getNegatedInteger(N0, DAG);
-  SDValue N1N = getNegatedInteger(N1, DAG);
-
   SDLoc DL(N);
   EVT VT = CSel.getValueType();
+
+  // If the operands are negations of each other, reverse them.
+  if ((isNegatedInteger(N0) && N0.getOperand(1) == N1) ||
+      (isNegatedInteger(N1) && N1.getOperand(1) == N0))
+    return DAG.getNode(AArch64ISD::CSEL, DL, VT, N1, N0, CSel.getOperand(2),
+                       CSel.getOperand(3));
+
+  SDValue N0N = getNegatedInteger(N0, DAG);
+  SDValue N1N = getNegatedInteger(N1, DAG);
   return DAG.getNode(AArch64ISD::CSEL, DL, VT, N0N, N1N, CSel.getOperand(2),
                      CSel.getOperand(3));
 }
diff --git a/llvm/test/CodeGen/AArch64/abds-neg.ll b/llvm/test/CodeGen/AArch64/abds-neg.ll
index 432ffc30eec5e..a7c27894e6125 100644
--- a/llvm/test/CodeGen/AArch64/abds-neg.ll
+++ b/llvm/test/CodeGen/AArch64/abds-neg.ll
@@ -8,9 +8,9 @@
 define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = sext i8 %a to i64
@@ -25,9 +25,9 @@ define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = sext i8 %a to i64
@@ -42,9 +42,9 @@ define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = sext i8 %a to i64
@@ -59,9 +59,9 @@ define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 define i16 @abd_ext_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = sext i16 %a to i64
@@ -94,9 +94,9 @@ define i16 @abd_ext_i16_i32(i16 %a, i32 %b) nounwind {
 define i16 @abd_ext_i16_undef(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = sext i16 %a to i64
diff --git a/llvm/test/CodeGen/AArch64/abds.ll b/llvm/test/CodeGen/AArch64/abds.ll
index ed1e6077948ee..3e1d8522aa9d8 100644
--- a/llvm/test/CodeGen/AArch64/abds.ll
+++ b/llvm/test/CodeGen/AArch64/abds.ll
@@ -8,9 +8,9 @@
 define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = sext i8 %a to i64
@@ -24,9 +24,9 @@ define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = sext i8 %a to i64
@@ -40,9 +40,9 @@ define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = sext i8 %a to i64
@@ -56,9 +56,9 @@ define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 define i16 @abd_ext_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = sext i16 %a to i64
@@ -88,9 +88,9 @@ define i16 @abd_ext_i16_i32(i16 %a, i32 %b) nounwind {
 define i16 @abd_ext_i16_undef(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = sext i16 %a to i64
@@ -220,9 +220,9 @@ define i128 @abd_ext_i128_undef(i128 %a, i128 %b) nounwind {
 define i8 @abd_minmax_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_minmax_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %min = call i8 @llvm.smin.i8(i8 %a, i8 %b)
@@ -234,9 +234,9 @@ define i8 @abd_minmax_i8(i8 %a, i8 %b) nounwind {
 define i16 @abd_minmax_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_minmax_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %min = call i16 @llvm.smin.i16(i16 %a, i16 %b)
@@ -294,9 +294,9 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 define i8 @abd_cmp_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_cmp_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp sgt i8 %a, %b
@@ -309,9 +309,9 @@ define i8 @abd_cmp_i8(i8 %a, i8 %b) nounwind {
 define i16 @abd_cmp_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_cmp_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp sge i16 %a, %b
@@ -517,11 +517,11 @@ define i64 @vector_legalized(i16 %a, i16 %b) {
 ; CHECK-LABEL: vector_legalized:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    movi v0.2d, #0000000000000000
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    addp d0, v0.2d
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w8, w8, mi
+; CHECK-NEXT:    addp d0, v0.2d
 ; CHECK-NEXT:    fmov x9, d0
 ; CHECK-NEXT:    add x0, x9, x8
 ; CHECK-NEXT:    ret
@@ -542,9 +542,9 @@ define i64 @vector_legalized(i16 %a, i16 %b) {
 define i8 @abd_select_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_select_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxtb w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxtb w8, w1
+; CHECK-NEXT:    sxtb w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp slt i8 %a, %b
@@ -557,9 +557,9 @@ define i8 @abd_select_i8(i8 %a, i8 %b) nounwind {
 define i16 @abd_select_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_select_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sxth w8, w0
-; CHECK-NEXT:    sub w8, w8, w1, sxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    sxth w8, w1
+; CHECK-NEXT:    sxth w9, w0
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp sle i16 %a, %b
diff --git a/llvm/test/CodeGen/AArch64/abdu-neg.ll b/llvm/test/CodeGen/AArch64/abdu-neg.ll
index 8fb106e92866e..a83323650330b 100644
--- a/llvm/test/CodeGen/AArch64/abdu-neg.ll
+++ b/llvm/test/CodeGen/AArch64/abdu-neg.ll
@@ -8,9 +8,9 @@
 define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = zext i8 %a to i64
@@ -25,9 +25,9 @@ define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = zext i8 %a to i64
@@ -42,9 +42,9 @@ define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = zext i8 %a to i64
@@ -59,9 +59,9 @@ define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 define i16 @abd_ext_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = zext i16 %a to i64
@@ -94,9 +94,9 @@ define i16 @abd_ext_i16_i32(i16 %a, i32 %b) nounwind {
 define i16 @abd_ext_i16_undef(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, pl
 ; CHECK-NEXT:    ret
   %aext = zext i16 %a to i64
diff --git a/llvm/test/CodeGen/AArch64/abdu.ll b/llvm/test/CodeGen/AArch64/abdu.ll
index 4585de96c848f..a058121236a5b 100644
--- a/llvm/test/CodeGen/AArch64/abdu.ll
+++ b/llvm/test/CodeGen/AArch64/abdu.ll
@@ -8,9 +8,9 @@
 define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = zext i8 %a to i64
@@ -24,9 +24,9 @@ define i8 @abd_ext_i8(i8 %a, i8 %b) nounwind {
 define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = zext i8 %a to i64
@@ -40,9 +40,9 @@ define i8 @abd_ext_i8_i16(i8 %a, i16 %b) nounwind {
 define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i8_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = zext i8 %a to i64
@@ -56,9 +56,9 @@ define i8 @abd_ext_i8_undef(i8 %a, i8 %b) nounwind {
 define i16 @abd_ext_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = zext i16 %a to i64
@@ -88,9 +88,9 @@ define i16 @abd_ext_i16_i32(i16 %a, i32 %b) nounwind {
 define i16 @abd_ext_i16_undef(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_ext_i16_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %aext = zext i16 %a to i64
@@ -224,9 +224,9 @@ define i128 @abd_ext_i128_undef(i128 %a, i128 %b) nounwind {
 define i8 @abd_minmax_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_minmax_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %min = call i8 @llvm.umin.i8(i8 %a, i8 %b)
@@ -238,9 +238,9 @@ define i8 @abd_minmax_i8(i8 %a, i8 %b) nounwind {
 define i16 @abd_minmax_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_minmax_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %min = call i16 @llvm.umin.i16(i16 %a, i16 %b)
@@ -300,9 +300,9 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 define i8 @abd_cmp_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_cmp_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp ugt i8 %a, %b
@@ -315,9 +315,9 @@ define i8 @abd_cmp_i8(i8 %a, i8 %b) nounwind {
 define i16 @abd_cmp_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_cmp_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp uge i16 %a, %b
@@ -382,11 +382,11 @@ define i64 @vector_legalized(i16 %a, i16 %b) {
 ; CHECK-LABEL: vector_legalized:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    movi v0.2d, #0000000000000000
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
-; CHECK-NEXT:    addp d0, v0.2d
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w8, w8, mi
+; CHECK-NEXT:    addp d0, v0.2d
 ; CHECK-NEXT:    fmov x9, d0
 ; CHECK-NEXT:    add x0, x9, x8
 ; CHECK-NEXT:    ret
@@ -407,9 +407,9 @@ define i64 @vector_legalized(i16 %a, i16 %b) {
 define i8 @abd_select_i8(i8 %a, i8 %b) nounwind {
 ; CHECK-LABEL: abd_select_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    sub w8, w8, w1, uxtb
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xff
+; CHECK-NEXT:    and w9, w0, #0xff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp ult i8 %a, %b
@@ -422,9 +422,9 @@ define i8 @abd_select_i8(i8 %a, i8 %b) nounwind {
 define i16 @abd_select_i16(i16 %a, i16 %b) nounwind {
 ; CHECK-LABEL: abd_select_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    sub w8, w8, w1, uxth
-; CHECK-NEXT:    cmp w8, #0
+; CHECK-NEXT:    and w8, w1, #0xffff
+; CHECK-NEXT:    and w9, w0, #0xffff
+; CHECK-NEXT:    subs w8, w9, w8
 ; CHECK-NEXT:    cneg w0, w8, mi
 ; CHECK-NEXT:    ret
   %cmp = icmp ule i16 %a, %b

efriedma-quic · 2025-07-29T21:17:16Z

Is there some reason to avoid generating subs w8, w8, w1, sxtb?

rj-jesus · 2025-07-29T21:31:04Z

Is there some reason to avoid generating subs w8, w8, w1, sxtb?

That would be another option, but I believe that on Neoverse V2 and a few other cores, extended ADDS/SUBS have higher latency (2 vs 1) and lower throughput (2 vs 3) than the non-extended versions, and by doing the extension separately we may expose more ILP.

rj-jesus · 2025-07-30T11:00:11Z

I did some digging and, with this patch, we should generate the combined extended forms when compiling for code size. For example, given:

#include <stdint.h>

typedef uint8_t u8;
typedef int8_t i8;

u8 src_u8(u8 a, u8 b) {
  return (a > b) ? a - b : b - a;
}

i8 src_i8(i8 a, i8 b) {
  return (a > b) ? a - b : b - a;
}

With -O2:

src_u8:
	and	w8, w1, #0xff
	and	w9, w0, #0xff
	subs	w8, w9, w8
	cneg	w0, w8, mi
	ret

src_i8:
	sxtb	w8, w1
	sxtb	w9, w0
	subs	w8, w9, w8
	cneg	w0, w8, mi
	ret

With -Os:

src_u8:
	and	w8, w0, #0xff
	subs	w8, w8, w1, uxtb
	cneg	w0, w8, mi
	ret

src_i8:
	sxtb	w8, w0
	subs	w8, w8, w1, sxtb
	cneg	w0, w8, mi
	ret

Perhaps the restrictions in isWorthFoldingALU could be loosened, possibly allowing multiple SUB/SUBS users if those could be CSE'd? As mentioned previously, despite the code size reduction, I'm not sure this would be a win for performance currently. Either way, this patch still seems applicable to allow the SUBS extended patterns above to be matched.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This patch avoids a comparison against zero when lowering abs(sub(a, b)) patterns, instead reusing the condition codes generated by a subs of the operands directly. For example, currently: ``` sxtb w8, w0 sub w8, w8, w1, sxtb cmp w8, #0 cneg w0, w8, pl ``` becomes: ``` sxtb w8, w0 sxtb w9, w1 subs w8, w9, w8 cneg w0, w8, gt ``` Whilst this doesn't decrease the number of instructions utilised, the new version exposes more ILP and uses ``cheaper'' instructions, typically having lower latency and/or higher throughput. This patch also includes a somewhat orthogonal change in performNegCSelCombine, which I included to avoid a code generation regression in CodeGen/AArch64/abd[su]-neg.ll due to the combine negating the operands of the csel, leading to an extra sub instruction. If preferable, I can open a separate PR for this.

Instead of reversing the operands when they are negations of each other, peek through already existing negations. This has the same effect when the operands are negations of each other but should be more generally useful.

rj-jesus · 2025-07-30T17:41:17Z

I've rebased this to update tests that had changed with #151177.

I've also simplified the changes in performNegCSelCombine since I figured we could achieve the same effect by peeking through negations when negating the operands of a CSEL. Doing this should achieve the same effect as reversing the CSEL's operands when they are negations of each other, but should be more generally useful (I'm happy to revert these changes if that would be preferable, though).

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

By combining the original SUB node to the SUBS created when folding subs(sub(a,b), 0) -> subs(a,b), we no longer create ``swapped'' SUB nodes, such as sub(b,a), after performNegCSelCombine. Therefore, the changes to avoid code gen regressions in some cases are no longer required.

rj-jesus requested review from davemgreen, paulwalker-arm and david-arm July 29, 2025 16:09

llvmbot added the backend:AArch64 label Jul 29, 2025

paulwalker-arm reviewed Jul 30, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

rj-jesus added 4 commits July 30, 2025 10:03

Combine subs(sub(a,b), 0) -> subs(a,b) in performCSELCombine.

7336546

Update tests after rebase.

e8efeab

Simplify performNegCSelCombine changes.

cba370b

Instead of reversing the operands when they are negations of each other, peek through already existing negations. This has the same effect when the operands are negations of each other but should be more generally useful.

rj-jesus force-pushed the rjj/aarch64-improve-abs-sub-lowering branch from ce097ff to cba370b Compare July 30, 2025 17:33

paulwalker-arm reviewed Jul 31, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

rj-jesus added 2 commits July 31, 2025 08:50

Combine Cond and Sub with Subs.

78619a5

paulwalker-arm approved these changes Aug 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Improve lowering of scalar abs(sub(a, b)). #151180

[AArch64] Improve lowering of scalar abs(sub(a, b)). #151180

rj-jesus commented Jul 29, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 29, 2025

Uh oh!

efriedma-quic commented Jul 29, 2025

Uh oh!

rj-jesus commented Jul 29, 2025

Uh oh!

rj-jesus commented Jul 30, 2025

Uh oh!

Uh oh!

rj-jesus commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

[AArch64] Improve lowering of scalar abs(sub(a, b)). #151180

Are you sure you want to change the base?

[AArch64] Improve lowering of scalar abs(sub(a, b)). #151180

Conversation

rj-jesus commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 29, 2025

Uh oh!

efriedma-quic commented Jul 29, 2025

Uh oh!

rj-jesus commented Jul 29, 2025

Uh oh!

rj-jesus commented Jul 30, 2025

Uh oh!

Uh oh!

rj-jesus commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

rj-jesus commented Jul 29, 2025 •

edited

Loading