Skip to content

ACL deconvolution crashes with large weights #1193

@fadara01

Description

@fadara01

PyTorch reproducer:

python -c "import torch;torch.nn.ConvTranspose1d(2016, 1026, 1024, stride=256)(torch.rand(1, 2016, 224))"

Standalone oneDNN (v3.7.1) with ACL (v25.02) reproducer (with benchdnn):

ONEDNN_VERBOSE=profile,profile_externals ./tests/benchdnn/benchdnn --deconv mb1_ic2016oc1026_ih1oh1kh1sh1dh0ph0_iw224ow58112kw1024sw256dw0pw0

Root cause:

The root cause in ACL is a write to an invalid address on this line

The address we write to is calculated using this ptr_to_element function which calculates the offset in offset_element_in_bytes. The offset in offset_element_in_bytes is int32_t which overflows in this case because of the massive number of parameters in the problem 2016 * 1026 * 1024 (i.e. returning an offset of -2139758464).

A smaller workload like torch.nn.ConvTranspose1d(500, 1026, 1024, stride=256)(torch.rand(1, 500, 224)) doesn't crash.

Suggested fixes:

  • use 64 bits for the offset (and the return type) in offset_element_in_bytes
  • OR make ACL fail the validation stage if the problem is too big.

Full Stack Trace: See pytorch/pytorch#165654

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions