Error Handling: refactor `ComputationClient::TransferFromDevice` to propagate status. #9429

ysiraichi · 2025-07-01T12:38:21Z

This PR makes 2 main changes in order to standardize and improve error handling:

ComputationClient::TransferFromDevice returns a StatusOr<T> instance
Wrap xla::PjRtLoadedExecutable::Execute(Sharded) with GetValueOrThrow

These changes mainly affect the errors whenever an OOM occurs. The second one targets eager mode. As an example, the following is the result of running the file below (without eager mode):

device = torch_xla.device()
a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)
b = a.sum()
print(b)

Before this PR:

F0000 00:00:1751368998.014824    2835 pjrt_computation_client.cpp:525] Non-OK-status: status
Status: INTERNAL: Error preparing computation: Out of memory allocating 4503599761588224 bytes.
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        torch_xla::runtime::PjRtComputationClient::TransferFromDevice(absl::lts_20230802::Span<std::shared_ptr<torch_xla::runtime::ComputationClient::Data> const>)
        ...
        _start
*** End stack trace ***
Failed to await future from buffer to literal inTransferFromDevice
*** Check failure stack trace: ***
    @     0x7ddc923438f9  absl::lts_20230802::log_internal::LogMessage::PrepareToDie()
    ...
Aborted (core dumped)

After this PR:

Traceback (most recent call last):
  File "examples/mem.py", line 11, in <module>
    print(b)
  File "torch/_tensor.py", line 590, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "torch/_tensor_str.py", line 726, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "torch/_tensor_str.py", line 462, in _str_intern
    self = self.to("cpu")
RuntimeError: Error preparing computation: Out of memory allocating 4503599627370496 bytes.

(with XLA_SHOW_CPP_ERROR_CONTEXT=1)

RuntimeError: Error preparing computation: Out of memory allocating 4503599627370496 bytes. (at torch_xla/csrc/runtime/pjrt_computation_client.cpp:524)

zhanyong-wan

Very nice!

Can you add a python test to ensure that OOM does result in a python exception with the expected error message as opposed to crashing?

zhanyong-wan

Great!

zhanyong-wan · 2025-07-21T05:31:30Z

test/test_operations.py

@@ -2458,6 +2458,15 @@ def test_add_broadcast_error(self):
      torch.add(a, b)
      torch_xla.sync()

+  def test_construct_large_tensor_raises_error(self):
+    with self.assertRaisesRegex(RuntimeError,
+                                r"Out of memory allocating \d* bytes"):


This comment was marked as outdated.

Sign in to view

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch 3 times, most recently from cef8c1e to 247fdf5 Compare July 1, 2025 16:29

ysiraichi mentioned this pull request Jul 1, 2025

Error Handling: propagate status for ReleaseGilAndTransferData and XlaDataToTensors. #9431

Draft

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch 2 times, most recently from b390a61 to 821c384 Compare July 1, 2025 18:15

ysiraichi changed the base branch from ysiraichi/status-qol-functions to master July 1, 2025 18:16

ysiraichi requested review from zhanyong-wan and ghpvnist July 1, 2025 18:20

ysiraichi marked this pull request as ready for review July 1, 2025 18:20

zhanyong-wan requested changes Jul 1, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from 821c384 to 08c5ecd Compare July 3, 2025 14:41

ysiraichi marked this pull request as draft July 3, 2025 14:41

This was referenced Jul 3, 2025

Error Handling: refactor ExecuteComputation and ExecuteReplicated to propagate status. #9445

Draft

Error Handling: replace XLA_CHECK_OK() with status functions. #9457

Merged

Propagate status on OOM crashes and exception.

a370c5d

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from 048459c to a370c5d Compare July 17, 2025 23:13

ysiraichi added 3 commits July 17, 2025 20:23

Fix lint error.

67254b7

Fix segfault.

e896924

Fix test.

b5d0ec2

ysiraichi marked this pull request as ready for review July 19, 2025 18:06

ysiraichi requested a review from zhanyong-wan July 19, 2025 18:06

zhanyong-wan approved these changes Jul 21, 2025

View reviewed changes

Address review.

46401cf

zhanyong-wan approved these changes Jul 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error Handling: refactor `ComputationClient::TransferFromDevice` to propagate status. #9429

Error Handling: refactor `ComputationClient::TransferFromDevice` to propagate status. #9429

Uh oh!

ysiraichi commented Jul 1, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

zhanyong-wan left a comment

Uh oh!

This comment was marked as outdated.

zhanyong-wan left a comment

Uh oh!

zhanyong-wan Jul 21, 2025

Uh oh!

Uh oh!

Error Handling: refactor ComputationClient::TransferFromDevice to propagate status. #9429

Are you sure you want to change the base?

Error Handling: refactor ComputationClient::TransferFromDevice to propagate status. #9429

Uh oh!

Conversation

ysiraichi commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

zhanyong-wan left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

zhanyong-wan left a comment

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Error Handling: refactor `ComputationClient::TransferFromDevice` to propagate status. #9429

Error Handling: refactor `ComputationClient::TransferFromDevice` to propagate status. #9429

ysiraichi commented Jul 1, 2025 •

edited

Loading