-
Notifications
You must be signed in to change notification settings - Fork 553
Error Handling: refactor ComputationClient::TransferFromDevice
to propagate status.
#9429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
cef8c1e
to
247fdf5
Compare
b390a61
to
821c384
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Can you add a python test to ensure that OOM does result in a python exception with the expected error message as opposed to crashing?
This comment was marked as outdated.
This comment was marked as outdated.
821c384
to
08c5ecd
Compare
048459c
to
a370c5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
test/test_operations.py
Outdated
@@ -2458,6 +2458,15 @@ def test_add_broadcast_error(self): | |||
torch.add(a, b) | |||
torch_xla.sync() | |||
|
|||
def test_construct_large_tensor_raises_error(self): | |||
with self.assertRaisesRegex(RuntimeError, | |||
r"Out of memory allocating \d* bytes"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
=> +
This PR makes 2 main changes in order to standardize and improve error handling:
ComputationClient::TransferFromDevice
returns aStatusOr<T>
instancexla::PjRtLoadedExecutable::Execute(Sharded)
withGetValueOrThrow
These changes mainly affect the errors whenever an OOM occurs. The second one targets eager mode. As an example, the following is the result of running the file below (without eager mode):
Before this PR:
After this PR:
(with
XLA_SHOW_CPP_ERROR_CONTEXT=1
)