Skip to content

Conversation

@simonlui
Copy link
Contributor

@simonlui simonlui commented Dec 19, 2024

I expect that there might be some opinions about 1.) and 2.) so I am open for anyone arguing for some detail changes or another way to implement if needs be. List of changes here include:

1.) Add in a --oneapi-device-selector that does something similar to --cuda-device but for Intel oneAPI devices. This doesn't need to necessarily be limited to GPUs but I expect for the time being that it will effectively only do that. Documentation on how to use this can be found at https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector

2.) Per https://github.com/pytorch/pytorch/blob/v2.5.0/docs/source/notes/numerical_accuracy.rst#reduced-precision-reduction-for-fp16-and-bf16-in-scaled-dot-product-attention-sdpa of which pytorch/pytorch#135778 brought this up to my attention, the default behavior using SDPA was changed in Pytorch 2.5 to upcast by default to avoid numerical errors. Since the old behavior has been working fine with ComfyUI, set torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp to default to true for ComfyUI if Pytorch 2.5 or up is detected.

3.) Documentation changes for IPEX and noting one can install the mainline builds of Pytorch to get ComfyUI working on it with the caveat that most optimizations aren't there yet. It's if anything still a beta release.
Defer to #6069 for documentation changes.

torch.backends.cuda.enable_mem_efficient_sdp(True)

if int(torch_version[0]) == 2 and int(torch_version[2]) >= 5:
torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no situations where the math backend is actually used by ComfyUI unless you force it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've explained this better. For any non-Nvidia GPUs, the math backend is what ends up being used if Pytorch Attention is selected since the Flash Attention and mem efficient backends are CUDA only, with AMD's implementations for both only making it around 1-2 weeks ago in the nightly packages. I can try and gate this off a bit better if you want, but the Pytorch change mentioned does slow things down for GPUs that are stuck in that situation like Intel right now.

@comfyanonymous comfyanonymous merged commit c6b9c11 into comfyanonymous:master Dec 23, 2024
5 checks passed
@simonlui simonlui deleted the add_xpu_device branch December 23, 2024 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants