BMG d2h copy is very slow compare to pvc and 4080s

### 🐛 Describe the bug

Simplified from a customer model script, where h2d is one of the bottleneck.

result:
Intel(R) Arc(TM) B580 Graphics
copy 1 bytes 11616 times
time (s): 1.3978004455566406


NVIDIA GeForce RTX 4080 SUPER
copy 1 bytes 11616 times
time (s): 0.0717000961303711


Intel(R) Data Center GPU Max 1550
copy 1 bytes 11616 times
time (s): 0.05427908897399902

NVIDIA A100-PCIE-40GB
copy 1 bytes 11616 times
time (s): 0.13671040534973145


```python
import torch
import time
device =  torch.accelerator.current_accelerator()
if device.type == "xpu":
    print(torch.xpu.get_device_name(device))
else:
    print(torch.cuda.get_device_name(device))
d_value = torch.tensor([True], device = device)
print("copy",d_value.element_size()*d_value.numel(),"bytes 11616 times")
for i in range (10):
    value = d_value.to("cpu")
torch.accelerator.synchronize()
s = time.time()
for i in range(11616):
    value = d_value.to("cpu")
torch.accelerator.synchronize()
e = time.time()
print("time (s):",e-s)

```

### Versions

pytorch 2.8 and current nightly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BMG d2h copy is very slow compare to pvc and 4080s #2157

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BMG d2h copy is very slow compare to pvc and 4080s #2157

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions