Skip to content

[ARM][AArch64] Vector intrinsics do not match hardware behavior for NaN, subnormals #128006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ostannard opened this issue Feb 20, 2025 · 3 comments

Comments

@ostannard
Copy link
Collaborator

The ARM/AArch64 vector intrinsics are defined as having the exact same behaviour as the hardware instructions.

For MVE:

The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

For AdvSIMD:

The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

However, clang does constant folding which doesn't always match the hardware's exact behaviour in cases like NaNs or subnormals.

For example, the MVE instructions always use a "default NaN" of 0x7ffc0000 (for single-precision) when the result of the instruction is any NaN, but we constant-fold this code down to return the input NaN value of 0xffffff42:

#include <arm_mve.h>

uint32x4_t foo() {
  float32x4_t nan = vreinterpretq_f32_u32(vdupq_n_u32(0xffffff42));
  float32x4_t nan_plus_nan = vaddq_f32(nan, nan);
  return vreinterpretq_u32_f32(nan_plus_nan);
}
$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S nan.c -o - -O1 -mfloat-abi=hard
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        vmvn.i32        q0, #0xbd
        bx      lr
...

For subnormals, MVE instructions always flush input subnormal values to zero, but we optimise this code as if that was not the case, so the result gets rounded up to 1.0f:

#include <arm_mve.h>

float32x4_t bar() {
  float32x4_t smallest_subnormal = vreinterpretq_f32_u32(vdupq_n_u32(1));
  float32x4_t round_up = vrndpq_f32(smallest_subnormal);
  return round_up;
}
$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S subnormal.c -o - -O1 -mfloat-abi=hard
...
bar:
        .fnstart
@ %bb.0:                                @ %entry
        mov.w   r0, #1065353216
        vdup.32 q0, r0
        bx      lr
...

For AArch64 AdvSIMD, the rounding mode and subnormal flushing behaviour are configurable with the FPCR register, but we also emit code which constant-folds these operations.

I think it would be reasonable to deviate from the ACLE here, and allow these optimisations depending on the floating-point options (e.g. -ffp-model=), but none of these options seem to have any effect on vector intrinsics.

@llvmbot
Copy link
Member

llvmbot commented Feb 20, 2025

@llvm/issue-subscribers-backend-aarch64

Author: Oliver Stannard (ostannard)

The ARM/AArch64 vector intrinsics are defined as having the exact same behaviour as the hardware instructions.

For MVE:
> The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

For AdvSIMD:
> The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

However, clang does constant folding which doesn't always match the hardware's exact behaviour in cases like NaNs or subnormals.

For example, the MVE instructions always use a "default NaN" of 0x7ffc0000 (for single-precision) when the result of the instruction is any NaN, but we constant-fold this code down to return the input NaN value of 0xffffff42:

#include &lt;arm_mve.h&gt;

uint32x4_t foo() {
  float32x4_t nan = vreinterpretq_f32_u32(vdupq_n_u32(0xffffff42));
  float32x4_t nan_plus_nan = vaddq_f32(nan, nan);
  return vreinterpretq_u32_f32(nan_plus_nan);
}
$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S nan.c -o - -O1 -mfloat-abi=hard
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        vmvn.i32        q0, #<!-- -->0xbd
        bx      lr
...

For subnormals, MVE instructions always flush input subnormal values to zero, but we optimise this code as if that was not the case, so the result gets rounded up to 1.0f:

#include &lt;arm_mve.h&gt;

float32x4_t bar() {
  float32x4_t smallest_subnormal = vreinterpretq_f32_u32(vdupq_n_u32(1));
  float32x4_t round_up = vrndpq_f32(smallest_subnormal);
  return round_up;
}
$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S subnormal.c -o - -O1 -mfloat-abi=hard
...
bar:
        .fnstart
@ %bb.0:                                @ %entry
        mov.w   r0, #<!-- -->1065353216
        vdup.32 q0, r0
        bx      lr
...

For AArch64 AdvSIMD, the rounding mode and subnormal flushing behaviour are configurable with the FPCR register, but we also emit code which constant-folds these operations.

I think it would be reasonable to deviate from the ACLE here, and allow these optimisations depending on the floating-point options (e.g. -ffp-model=), but none of these options seem to have any effect on vector intrinsics.

@llvmbot
Copy link
Member

llvmbot commented Feb 20, 2025

@llvm/issue-subscribers-backend-arm

Author: Oliver Stannard (ostannard)

The ARM/AArch64 vector intrinsics are defined as having the exact same behaviour as the hardware instructions.

For MVE:
> The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

For AdvSIMD:
> The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.
>
> A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

However, clang does constant folding which doesn't always match the hardware's exact behaviour in cases like NaNs or subnormals.

For example, the MVE instructions always use a "default NaN" of 0x7ffc0000 (for single-precision) when the result of the instruction is any NaN, but we constant-fold this code down to return the input NaN value of 0xffffff42:

#include &lt;arm_mve.h&gt;

uint32x4_t foo() {
  float32x4_t nan = vreinterpretq_f32_u32(vdupq_n_u32(0xffffff42));
  float32x4_t nan_plus_nan = vaddq_f32(nan, nan);
  return vreinterpretq_u32_f32(nan_plus_nan);
}
$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S nan.c -o - -O1 -mfloat-abi=hard
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        vmvn.i32        q0, #<!-- -->0xbd
        bx      lr
...

For subnormals, MVE instructions always flush input subnormal values to zero, but we optimise this code as if that was not the case, so the result gets rounded up to 1.0f:

#include &lt;arm_mve.h&gt;

float32x4_t bar() {
  float32x4_t smallest_subnormal = vreinterpretq_f32_u32(vdupq_n_u32(1));
  float32x4_t round_up = vrndpq_f32(smallest_subnormal);
  return round_up;
}
$ /work/llvm/build/bin/clang --target=arm-none-eabi -march=armv8.1-m.main+mve.fp -S subnormal.c -o - -O1 -mfloat-abi=hard
...
bar:
        .fnstart
@ %bb.0:                                @ %entry
        mov.w   r0, #<!-- -->1065353216
        vdup.32 q0, r0
        bx      lr
...

For AArch64 AdvSIMD, the rounding mode and subnormal flushing behaviour are configurable with the FPCR register, but we also emit code which constant-folds these operations.

I think it would be reasonable to deviate from the ACLE here, and allow these optimisations depending on the floating-point options (e.g. -ffp-model=), but none of these options seem to have any effect on vector intrinsics.

@efriedma-quic
Copy link
Collaborator

For vector intrinsics not respecting strictfp, a few people are working on that in a target-independent context, trying to change the way "constrained" fp is represented. Don't have time to dig it up right now.

32-bit NEON/MVE in particular is weird because it doesn't respect the floating-point control word; see #16648/#106909/etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants