Skip to content

BMG d2h copy is very slow compare to pvc and 4080s #2157

@jianyizh

Description

@jianyizh

🐛 Describe the bug

Simplified from a customer model script, where h2d is one of the bottleneck.

result:
Intel(R) Arc(TM) B580 Graphics
copy 1 bytes 11616 times
time (s): 1.3978004455566406

NVIDIA GeForce RTX 4080 SUPER
copy 1 bytes 11616 times
time (s): 0.0717000961303711

Intel(R) Data Center GPU Max 1550
copy 1 bytes 11616 times
time (s): 0.05427908897399902

NVIDIA A100-PCIE-40GB
copy 1 bytes 11616 times
time (s): 0.13671040534973145

import torch
import time
device =  torch.accelerator.current_accelerator()
if device.type == "xpu":
    print(torch.xpu.get_device_name(device))
else:
    print(torch.cuda.get_device_name(device))
d_value = torch.tensor([True], device = device)
print("copy",d_value.element_size()*d_value.numel(),"bytes 11616 times")
for i in range (10):
    value = d_value.to("cpu")
torch.accelerator.synchronize()
s = time.time()
for i in range(11616):
    value = d_value.to("cpu")
torch.accelerator.synchronize()
e = time.time()
print("time (s):",e-s)

Versions

pytorch 2.8 and current nightly

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions