-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Description
🐛 Describe the bug
Simplified from a customer model script, where h2d is one of the bottleneck.
result:
Intel(R) Arc(TM) B580 Graphics
copy 1 bytes 11616 times
time (s): 1.3978004455566406
NVIDIA GeForce RTX 4080 SUPER
copy 1 bytes 11616 times
time (s): 0.0717000961303711
Intel(R) Data Center GPU Max 1550
copy 1 bytes 11616 times
time (s): 0.05427908897399902
NVIDIA A100-PCIE-40GB
copy 1 bytes 11616 times
time (s): 0.13671040534973145
import torch
import time
device = torch.accelerator.current_accelerator()
if device.type == "xpu":
print(torch.xpu.get_device_name(device))
else:
print(torch.cuda.get_device_name(device))
d_value = torch.tensor([True], device = device)
print("copy",d_value.element_size()*d_value.numel(),"bytes 11616 times")
for i in range (10):
value = d_value.to("cpu")
torch.accelerator.synchronize()
s = time.time()
for i in range(11616):
value = d_value.to("cpu")
torch.accelerator.synchronize()
e = time.time()
print("time (s):",e-s)Versions
pytorch 2.8 and current nightly