[ARM][AArch64] Vector intrinsics do not match hardware behavior for NaN, subnormals #128006

ostannard · 2025-02-20T13:44:54Z

The ARM/AArch64 vector intrinsics are defined as having the exact same behaviour as the hardware instructions.

For MVE:

The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

For AdvSIMD:

The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

However, clang does constant folding which doesn't always match the hardware's exact behaviour in cases like NaNs or subnormals.

For example, the MVE instructions always use a "default NaN" of 0x7ffc0000 (for single-precision) when the result of the instruction is any NaN, but we constant-fold this code down to return the input NaN value of 0xffffff42:

#include <arm_mve.h>

uint32x4_t foo() {
  float32x4_t nan = vreinterpretq_f32_u32(vdupq_n_u32(0xffffff42));
  float32x4_t nan_plus_nan = vaddq_f32(nan, nan);
  return vreinterpretq_u32_f32(nan_plus_nan);
}

$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S nan.c -o - -O1 -mfloat-abi=hard
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        vmvn.i32        q0, #0xbd
        bx      lr
...

For subnormals, MVE instructions always flush input subnormal values to zero, but we optimise this code as if that was not the case, so the result gets rounded up to 1.0f:

#include <arm_mve.h>

float32x4_t bar() {
  float32x4_t smallest_subnormal = vreinterpretq_f32_u32(vdupq_n_u32(1));
  float32x4_t round_up = vrndpq_f32(smallest_subnormal);
  return round_up;
}

$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S subnormal.c -o - -O1 -mfloat-abi=hard
...
bar:
        .fnstart
@ %bb.0:                                @ %entry
        mov.w   r0, #1065353216
        vdup.32 q0, r0
        bx      lr
...

For AArch64 AdvSIMD, the rounding mode and subnormal flushing behaviour are configurable with the FPCR register, but we also emit code which constant-folds these operations.

I think it would be reasonable to deviate from the ACLE here, and allow these optimisations depending on the floating-point options (e.g. -ffp-model=), but none of these options seem to have any effect on vector intrinsics.

The text was updated successfully, but these errors were encountered:

llvmbot · 2025-02-20T13:46:18Z

@llvm/issue-subscribers-backend-aarch64

Author: Oliver Stannard (ostannard)

The ARM/AArch64 vector intrinsics are defined as having the exact same behaviour as the hardware instructions.

For MVE:
> The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

For AdvSIMD:
> The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

However, clang does constant folding which doesn't always match the hardware's exact behaviour in cases like NaNs or subnormals.

For example, the MVE instructions always use a "default NaN" of 0x7ffc0000 (for single-precision) when the result of the instruction is any NaN, but we constant-fold this code down to return the input NaN value of 0xffffff42:

#include &lt;arm_mve.h&gt;

uint32x4_t foo() {
  float32x4_t nan = vreinterpretq_f32_u32(vdupq_n_u32(0xffffff42));
  float32x4_t nan_plus_nan = vaddq_f32(nan, nan);
  return vreinterpretq_u32_f32(nan_plus_nan);
}

$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S nan.c -o - -O1 -mfloat-abi=hard
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        vmvn.i32        q0, #<!-- -->0xbd
        bx      lr
...

For subnormals, MVE instructions always flush input subnormal values to zero, but we optimise this code as if that was not the case, so the result gets rounded up to 1.0f:

#include &lt;arm_mve.h&gt;

float32x4_t bar() {
  float32x4_t smallest_subnormal = vreinterpretq_f32_u32(vdupq_n_u32(1));
  float32x4_t round_up = vrndpq_f32(smallest_subnormal);
  return round_up;
}

$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S subnormal.c -o - -O1 -mfloat-abi=hard
...
bar:
        .fnstart
@ %bb.0:                                @ %entry
        mov.w   r0, #<!-- -->1065353216
        vdup.32 q0, r0
        bx      lr
...

For AArch64 AdvSIMD, the rounding mode and subnormal flushing behaviour are configurable with the FPCR register, but we also emit code which constant-folds these operations.

I think it would be reasonable to deviate from the ACLE here, and allow these optimisations depending on the floating-point options (e.g. -ffp-model=), but none of these options seem to have any effect on vector intrinsics.

llvmbot · 2025-02-20T13:46:20Z

@llvm/issue-subscribers-backend-arm

Author: Oliver Stannard (ostannard)

The ARM/AArch64 vector intrinsics are defined as having the exact same behaviour as the hardware instructions.

For MVE:
> The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

For AdvSIMD:
> The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

However, clang does constant folding which doesn't always match the hardware's exact behaviour in cases like NaNs or subnormals.

For example, the MVE instructions always use a "default NaN" of 0x7ffc0000 (for single-precision) when the result of the instruction is any NaN, but we constant-fold this code down to return the input NaN value of 0xffffff42:

#include &lt;arm_mve.h&gt;

uint32x4_t foo() {
  float32x4_t nan = vreinterpretq_f32_u32(vdupq_n_u32(0xffffff42));
  float32x4_t nan_plus_nan = vaddq_f32(nan, nan);
  return vreinterpretq_u32_f32(nan_plus_nan);
}

$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S nan.c -o - -O1 -mfloat-abi=hard
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        vmvn.i32        q0, #<!-- -->0xbd
        bx      lr
...

For subnormals, MVE instructions always flush input subnormal values to zero, but we optimise this code as if that was not the case, so the result gets rounded up to 1.0f:

#include &lt;arm_mve.h&gt;

float32x4_t bar() {
  float32x4_t smallest_subnormal = vreinterpretq_f32_u32(vdupq_n_u32(1));
  float32x4_t round_up = vrndpq_f32(smallest_subnormal);
  return round_up;
}

$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S subnormal.c -o - -O1 -mfloat-abi=hard
...
bar:
        .fnstart
@ %bb.0:                                @ %entry
        mov.w   r0, #<!-- -->1065353216
        vdup.32 q0, r0
        bx      lr
...

For AArch64 AdvSIMD, the rounding mode and subnormal flushing behaviour are configurable with the FPCR register, but we also emit code which constant-folds these operations.

I think it would be reasonable to deviate from the ACLE here, and allow these optimisations depending on the floating-point options (e.g. -ffp-model=), but none of these options seem to have any effect on vector intrinsics.

efriedma-quic · 2025-02-20T19:21:39Z

For vector intrinsics not respecting strictfp, a few people are working on that in a target-independent context, trying to change the way "constrained" fp is represented. Don't have time to dig it up right now.

32-bit NEON/MVE in particular is weird because it doesn't respect the floating-point control word; see #16648/#106909/etc.

llvmbot added the new issue label Feb 20, 2025

ostannard added backend:ARM backend:AArch64 miscompilation and removed new issue labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM][AArch64] Vector intrinsics do not match hardware behavior for NaN, subnormals #128006

[ARM][AArch64] Vector intrinsics do not match hardware behavior for NaN, subnormals #128006

ostannard commented Feb 20, 2025

llvmbot commented Feb 20, 2025

llvmbot commented Feb 20, 2025

efriedma-quic commented Feb 20, 2025

[ARM][AArch64] Vector intrinsics do not match hardware behavior for NaN, subnormals #128006

[ARM][AArch64] Vector intrinsics do not match hardware behavior for NaN, subnormals #128006

Comments

ostannard commented Feb 20, 2025

llvmbot commented Feb 20, 2025

llvmbot commented Feb 20, 2025

efriedma-quic commented Feb 20, 2025