[FlexAttention] FlexDecoding accuracy discrepancy between XPU and CUDA while compiling torch.ops.higher_order.flex_attention
#3588
Milestone
torch.ops.higher_order.flex_attention
#3588
Describe the bug
Issue Description:
While running the XPU FlexDecoding UT, we found a test failure due to a tensor mismatch.
We captured the compiled results on both XPU and CUDA, unified their outputs, and found that running the Triton codes generated by UT will get different results.
triton-code.zip
Environment details
You can refer to this issue to setup a reproducing environment:
#3518
The text was updated successfully, but these errors were encountered: