๐ ๏ธ Fix: Ensure DTW cost tensor uses the same device as the input tensor (x.device) #2561
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
๐ Problem
In multi-GPU environments, the dtw_cuda() function in whisper/timing.py raises the following error during transcription with word timestamps enabled:
This occurs because the cost tensor is sent to .cuda() without specifying a device. By default, .cuda() places the tensor on cuda:0, which leads to a device mismatch when the input tensor x resides on a different GPU (e.g., cuda:1).
๐ Root Cause
In dtw_cuda(), the cost tensor is being moved to GPU using:
This assumes all data lives on cuda:0, which is not true in multi-GPU setups. As a result, Triton throws a device access error when trying to launch the kernel with mismatched tensors.
โ Solution
The fix is to ensure the cost tensor is sent to the same device as the input tensor:
This guarantees consistency and allows the Triton kernel to access all pointers correctly.