Use `LoadN` for short BF16 dot product inputs instead of scalar conversion #2703

vekterli · 2025-09-10T15:53:18Z

Resolves issue #2699.

This changes the fallback logic for inputs with fewer elements than that of a full vector's lane count to use partial vector loads with implicit zeroing (via LoadN) instead of scalar loads and conversions.

I think these should be fairly conservative changes, as the main vector loops remain untouched. An alternative approach would be to generalize the final boundary handling code for < N remaining elements to use LoadN instead of FirstN+"adjusted" LoadU, which would remove the need for such an initial special case in the first place. Since I don't have access to enough exotic testing hardware to know how/if this would affect performance on non-x64/arm64 systems, I skipped on doing that. 🙂

It's possible that the use of ReorderWidenMulAccumulate should be replaced with explicit Promote(Upper|Lower)To and multiplication since the input sums will always be zero, but I am not sure how much it matters in practice. This just mirrors what's in the main loops.

Rationale/background:

Falling back to scalar conversion of __bf16 to float elements will on GCC 13+ end up silently generating a call to the __extendbfsf2 library function per conversion, rather than just zero-extending to 32 bits and shifting left by 16. This both bloats the generated code and has a substantial runtime cost; on x64 it takes take more time to process 8 BF16 elements with the scalar fallback path than it takes to process 1024 BF16 elements in the vector path..!

Partial vector loads are introduced for the following dot product function overloads:

bfloat16_t vs. bfloat16_t (used F32FromBF16)
float vs. bfloat16_t (used ConvertScalarTo<float> which transitively calls F32FromBF16)

Some before/after graphs from a Sapphire Rapids system:

bfloat16_t vs. bfloat16_t:

Before:

After:

float vs. bfloat16_t:

Before:

After:

…rsion This changes the prologue fallback logic for inputs with fewer elements than that of a full vector's lane count to use partial vector loads with implicit zeroing (via `LoadN`) instead of scalar loads and conversions. Rationale: Falling back to scalar conversion of `__bf16` to `float` elements will on GCC 13+ end up silently generating a call to the `__extendbfsf2` library function per conversion, rather than just zero-extending to 32 bits and shifting left by 16. This both bloats the generated code and has a substantial runtime cost; on x64 it takes take more time to process 8 BF16 elements with the scalar fallback path than it takes to process 1024 BF16 elements in the vector path..! Partial vector loads are introduced for the following dot product function overloads: * `bfloat16_t` vs. `bfloat16_t` * `float` vs `bfloat16_t`

jan-wassenberg

Nice, thank you for updating this!

jan-wassenberg · 2025-09-19T12:31:52Z

Re-importing so we can see what went wrong with internal CI..

jan-wassenberg · 2025-09-19T12:58:58Z

Looks like a compiler crash, sigh. I raised an LLVM issue, linked above.

vekterli · 2025-09-22T15:42:28Z

Changing

V sum1 = Zero(df32);
V sum0 = ReorderWidenMulAccumulate(df32, a, b, Zero(df32), sum1);
return ReduceSum(df32, Add(sum0, sum1));

to

const auto ab_lo = Mul(PromoteLowerTo(df32, a), PromoteLowerTo(df32, b));
const auto ab_hi = Mul(PromoteUpperTo(df32, a), PromoteUpperTo(df32, b));
return ReduceSum(df32, Add(ab_lo, ab_hi));

seems to avoid triggering the Clang crash (Godbolt), and should be semantically identical. Performance-wise it also looks very similar to me (after re-running the short input BF16xBF16 benchmarks from the original comment).

If this sounds good (please let me know if there are preferred alternate approaches), I can update this branch with these changes.

jan-wassenberg · 2025-09-22T16:37:15Z

I agree that's semantically equivalent, and could be a viable workaround, but this should really be a two-instruction sequence : BFDOT+ADDV. Perhaps we can put your workaround inside an #if HWY_TARGET_IS_SVE && (!HWY_COMPILER_CLANG || HWY_COMPILER_CLANG >= 2300)?

vekterli · 2025-09-22T18:22:48Z

Hmm it looks like this crashes under NEON_BF16 as well (https://godbolt.org/z/YxsnhP49q - armv8.2-a+aes+fp16+bf16+dotprod).

The slight irony of the situation is that this is all due to GCC BF16 scalar codegen, so the scalar fallback code under Clang is probably perfectly fine as it is... 🙂

jan-wassenberg · 2025-09-23T08:13:07Z

;) This is BTW the third compiler issue we are seeing within a week. It seems LLVM's testing coverage is not yet there.

I'd be fine to land your current change (with ReorderWidenMulAccumulate) #if HWY_COMPILER_GCC_ACTUAL, and otherwise the prior scalar fallback, with a TODO comment mentioning this PR.

Clang does not need the `LoadN` fallback and can happily use the scalar code. Coincidentally, this also avoids triggering a Clang crash when `ReorderWidenMulAccumulate` is used alongside capped vectors.

vekterli · 2025-09-26T12:35:27Z

I'd be fine to land your current change (with ReorderWidenMulAccumulate) #if HWY_COMPILER_GCC_ACTUAL, and otherwise the prior scalar fallback, with a TODO comment mentioning this PR.

Changes have been made.

Looks like some build failures are happening in the realm of hwy/contrib/thread_pool, I presume these are unrelated.

FWIW we created an upstream bug report to GCC a few weeks ago about this issue (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121853).

If this behavior remains unchanged, scalar conversion will continue to be a bit of a performance mine field. A sledge hammer option (that I'm also pretty sure would remove the need for this particular PR entirely) is to change F32FromBF16 to pretend __bf16 is still effectively just a typedef of short on GCC 👀

 HWY_API HWY_BF16_CONSTEXPR float F32FromBF16(bfloat16_t bf) {
-#if HWY_HAVE_SCALAR_BF16_OPERATORS
+#if HWY_HAVE_SCALAR_BF16_OPERATORS && !HWY_COMPILER_GCC_ACTUAL
   return static_cast<float>(bf);
 #else
   return BitCastScalar<float>(static_cast<uint32_t>(

jan-wassenberg · 2025-09-26T14:25:45Z

Thanks for making the change! Re-imported for internal review, hopefully landing soon.

Looks like some build failures are happening in the realm of hwy/contrib/thread_pool, I presume these are unrelated.

ACK, triggered by clang warning fixes, thanks for the heads up.

If this behavior remains unchanged, scalar conversion will continue to be a bit of a performance mine field.

Even if this is fixed tomorrow, we'd still have to live with current compilers for several years. I am open to this suggestion. Want to send a separate PR to change the default value of HWY_HAVE_SCALAR_BF16_OPERATORS to HWY_HAVE_SCALAR_BF16_TYPE && (HWY_COMPILER_GCC_ACTUAL >= 1700) with a comment mentioning the issue?

vekterli · 2025-09-29T09:06:02Z

Since #2733 is now in, I don't think this PR serves much purpose. Should we just close it alongside #2699?

jan-wassenberg · 2025-09-29T09:33:28Z

I have mixed feelings. If it weren't for the clang bug, it would still be better to have vector code for the remainders. As is, it's indeed regrettable to have it enabled only for GCC. Maybe we can keep this open and un-merged until that's fixed, WDYT?

vekterli · 2025-09-29T12:14:43Z

Maybe we can keep this open and un-merged until that's fixed, WDYT?

Sounds good to me 👍

jan-wassenberg previously approved these changes Sep 12, 2025

View reviewed changes

jan-wassenberg added the ready to pull label Sep 12, 2025

jan-wassenberg added ready to pull and removed ready to pull labels Sep 19, 2025

jan-wassenberg mentioned this pull request Sep 19, 2025

[clang] Crash on aarch64 bf16 SIMD llvm/llvm-project#159772

Closed

Only use LoadN fallback when compiling with GCC

df82cf0

Clang does not need the `LoadN` fallback and can happily use the scalar code. Coincidentally, this also avoids triggering a Clang crash when `ReorderWidenMulAccumulate` is used alongside capped vectors.

vekterli dismissed jan-wassenberg’s stale review via df82cf0 September 26, 2025 11:53

vekterli mentioned this pull request Sep 26, 2025

Disable scalar BF16 operators on current GCC versions #2733

Merged

Use LoadN for short BF16 dot product inputs instead of scalar conversion #2703

Are you sure you want to change the base?

Use LoadN for short BF16 dot product inputs instead of scalar conversion #2703

Uh oh!

Conversation

vekterli commented Sep 10, 2025

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented Sep 19, 2025

Uh oh!

jan-wassenberg commented Sep 19, 2025

Uh oh!

vekterli commented Sep 22, 2025

Uh oh!

jan-wassenberg commented Sep 22, 2025

Uh oh!

vekterli commented Sep 22, 2025

Uh oh!

jan-wassenberg commented Sep 23, 2025

Uh oh!

vekterli commented Sep 26, 2025

Uh oh!

jan-wassenberg commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vekterli commented Sep 29, 2025

Uh oh!

jan-wassenberg commented Sep 29, 2025

Uh oh!

vekterli commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use `LoadN` for short BF16 dot product inputs instead of scalar conversion #2703

Use `LoadN` for short BF16 dot product inputs instead of scalar conversion #2703

jan-wassenberg commented Sep 26, 2025 •

edited

Loading