Error Handling: propagate status for `ReleaseGilAndTransferData` and `XlaDataToTensors`. #9431

ysiraichi · 2025-07-01T16:42:24Z

This PR refactors our error handling by replacing GetValueOrThrow with proper status propagation using absl::StatusOr<T> and XLA_ASSIGN_OR_RETURN macros.

Key Changes:

ReleaseGilAndTransferData Function:
- Updated the function signature to return absl::StatusOr<std::vector<xla::Literal>>.
- Replaced GetComputationClientOrDie() with GetComputationClient().
- Utilized XLA_ASSIGN_OR_RETURN for client acquisition and TransferFromDevice calls.
- Updated callers in tensor_util.cpp and xla_graph_executor.cpp to handle the new StatusOr<T> return type.
XlaDataToTensors Function:
- Modified the function signature to return absl::StatusOr<std::vector<at::Tensor>>.
- Replaced GetValueOrThrow with XLA_ASSIGN_OR_RETURN for the ReleaseGilAndTransferData call.
- Updated all callers (including XLATensor::ToTensor, test_xla_sharding.cpp, init_python_bindings.cpp, and xla_backend_impl.cpp) to correctly handle the StatusOr<T> return type.
- Added necessary status.h includes to xla_backend_impl.cpp and test_xla_sharding.cpp.

These modifications align with existing status propagation patterns in the codebase, as seen in pjrt_registry.cpp, and maintain API-level backward compatibility while improving internal error handling within the tensor conversion pipeline.

ysiraichi · 2025-07-01T16:42:50Z

Blocked until #9429 is merged.

- Refactor status tests - Remove test_status.cpp as its tests are now covered by specialized context tests - Both specialized tests now cover all status utility functions and macros

- Add `XLA_RETURN_IF_ERROR_WITH_LOCATION` macro for external library status propagation - Add `XLA_ASSIGN_OR_RETURN_WITH_LOCATION` macro for external library status handling - Enhance test coverage with new test cases for location-specific macro variants - Improve macro documentation to clarify internal vs external usage patterns

…ansferData` Modify `ReleaseGilAndTransferData` function to use proper status propagation instead of `GetValueOrThrow` with `GetComputationClientOrDie`. This improves error handling by allowing status types to be propagated up the call stack rather than immediately throwing exceptions. Changes: - Update function signature to return `absl::StatusOr<std::vector<xla::Literal>>` - Replace `GetComputationClientOrDie()` with `GetComputationClient()` - Use `XLA_ASSIGN_OR_RETURN` macros for both client acquisition and `TransferFromDevice` - Update callers in tensor_util.cpp and xla_graph_executor.cpp to handle `StatusOr<T>` This follows the status propagation patterns used elsewhere in the codebase and aligns with the examples in pjrt_registry.cpp.

Modify `XlaDataToTensors` function to use proper status propagation instead of `GetValueOrThrow`, and update all callers to handle the new `StatusOr<T>` return type. This continues the status propagation improvements started with `ReleaseGilAndTransferData`. Changes: - Update `XlaDataToTensors` signature to return `absl::StatusOr<std::vector<at::Tensor>>` - Replace `GetValueOrThrow` with `XLA_ASSIGN_OR_RETURN` for `ReleaseGilAndTransferData` call - Update all callers to use `GetValueOrThrow` wrapper: - `XLATensor::ToTensor` in tensor.cpp:515 - test_xla_sharding.cpp:31 - init_python_bindings.cpp:2716 - xla_backend_impl.cpp:95 - Add necessary status.h includes to xla_backend_impl.cpp and test_xla_sharding.cpp This maintains backward compatibility at the API level while enabling proper status propagation internally within the tensor conversion pipeline.

ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 9d505e7 to 5d4742b Compare July 1, 2025 16:44

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from 247fdf5 to b390a61 Compare July 1, 2025 18:11

ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 5d4742b to 40a75d7 Compare July 1, 2025 18:11

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from b390a61 to 821c384 Compare July 1, 2025 18:15

ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 40a75d7 to b0e25da Compare July 1, 2025 18:15

ysiraichi added 5 commits July 2, 2025 15:27

Fix status source code location logic.

0c5b199

- Refactor status tests - Remove test_status.cpp as its tests are now covered by specialized context tests - Both specialized tests now cover all status utility functions and macros

Modify nested error test.

19409cd

Add default capture.

d6ba41f

Propagate status on OOM crashes and exception.

08c5ecd

ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from b0e25da to 97ef4c1 Compare July 3, 2025 14:41

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from 821c384 to 08c5ecd Compare July 3, 2025 14:41

ysiraichi added 3 commits July 3, 2025 12:42

Test + Use *WITH_LOCATION macro for propagating the error.

048459c

ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 97ef4c1 to de09876 Compare July 3, 2025 15:42

ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch 2 times, most recently from 46401cf to 0cc5400 Compare July 21, 2025 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error Handling: propagate status for `ReleaseGilAndTransferData` and `XlaDataToTensors`. #9431

Error Handling: propagate status for `ReleaseGilAndTransferData` and `XlaDataToTensors`. #9431

Uh oh!

ysiraichi commented Jul 1, 2025

Uh oh!

ysiraichi commented Jul 1, 2025

Uh oh!

Uh oh!

Error Handling: propagate status for ReleaseGilAndTransferData and XlaDataToTensors. #9431

Are you sure you want to change the base?

Error Handling: propagate status for ReleaseGilAndTransferData and XlaDataToTensors. #9431

Uh oh!

Conversation

ysiraichi commented Jul 1, 2025

Key Changes:

Uh oh!

ysiraichi commented Jul 1, 2025

Uh oh!

Uh oh!

Error Handling: propagate status for `ReleaseGilAndTransferData` and `XlaDataToTensors`. #9431

Error Handling: propagate status for `ReleaseGilAndTransferData` and `XlaDataToTensors`. #9431