Commit 0478c13
committed
Use
This changes the prologue fallback logic for inputs with fewer elements
than that of a full vector's lane count to use partial vector loads with
implicit zeroing (via `LoadN`) instead of scalar loads and conversions.
Rationale:
Falling back to scalar conversion of `__bf16` to `float` elements will
on GCC 13+ end up silently generating a call to the `__extendbfsf2`
library function per conversion, rather than just zero-extending to 32
bits and shifting left by 16. This both bloats the generated code and
has a substantial runtime cost; on x64 it takes take more time to process
8 BF16 elements with the scalar fallback path than it takes to process
1024 BF16 elements in the vector path..!
Partial vector loads are introduced for the following dot product
function overloads:
* `bfloat16_t` vs. `bfloat16_t`
* `float` vs `bfloat16_t`LoadN for short BF16 dot product inputs instead of scalar conversion1 parent 7f59ca4 commit 0478c13
1 file changed
+10
-24
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
174 | 174 | | |
175 | 175 | | |
176 | 176 | | |
177 | | - | |
| 177 | + | |
178 | 178 | | |
179 | 179 | | |
180 | | - | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
192 | 183 | | |
193 | 184 | | |
194 | 185 | | |
| |||
279 | 270 | | |
280 | 271 | | |
281 | 272 | | |
282 | | - | |
| 273 | + | |
283 | 274 | | |
284 | 275 | | |
285 | | - | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
290 | | - | |
291 | | - | |
292 | | - | |
293 | | - | |
294 | | - | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
295 | 281 | | |
296 | 282 | | |
297 | 283 | | |
| |||
0 commit comments