-
Notifications
You must be signed in to change notification settings - Fork 24
[ADR] Graceful Device Initialization Failure Handling #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nateinaction
wants to merge
5
commits into
main
Choose a base branch
from
adr-safe-device-failure
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
a7da2f7
Add initial Safe Device Failure ADR
nateinaction 23e02dc
Change name to Graceful Device Failure
nateinaction bb36d82
Name change
nateinaction c390ae0
Merge branch 'main' into adr-safe-device-failure
ineskhou 0bcc479
Merge branch 'main' into adr-safe-device-failure
ineskhou File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,198 @@ | ||
| # Graceful Device Initialization Failure Handling | ||
|
|
||
| ## Status | ||
|
|
||
| Proposed | ||
|
|
||
| ## Context | ||
|
|
||
| This ADR proposes patterns allowing non-critical devices to fail while ensuring downstream services are not impacted. For example, if the IMU fails to initialize, we want to ensure that the satellite continues operation, but services such as de-tumbling can understand that the IMU is not available and adjust their behavior accordingly. In this case, the de-tumbling service would not engage magnetorquers, reaction wheels, or thrusters. | ||
|
|
||
| This ADR does not attempt to solve partial failures of a device where only some of its functionality is available or when it returns faulty data. This problem can be addressed in a future ADR. | ||
|
|
||
| ### Prior Implementation | ||
| In a previous version of PySquared, each device was initialized using a `try/except` block. [A dictionary of bools](https://github.com/proveskit/pysquared/blob/c03e6ed069168f7e6c5df0c228ec86429db51e7f/lib/pysquared/pysquared.py#L226-L245) representing device initialization state was passed to each initialization function. Each device would set a key to `True` if initialized. There were various checks throughout the codebase to determine if a device was initialized before use. | ||
|
|
||
| ### Current State | ||
| Currently, all device initialization failures are fatal, meaning that if a device fails to initialize, the entire root `try/except` in `main.py` will be triggered, and the satellite will restart. For example, the [IMU initialization](https://github.com/proveskit/CircuitPython_RP2040_v4/blob/4693e43a373030e74dad109c0858c9e36998f1df/main.py#L94) in the v4 board repo is not handling exceptions. | ||
|
|
||
| Possibly by accident, all current hardware device managers are safe to call after a failed initialization. As long as their initialization is wrapped in a `try/except` block, the device manager will return `None` for all subsequent calls for sensor readings. | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### Pattern 1: Generic Device Protocol with `get_state()` | ||
|
|
||
| We propose to add a `get_state()` method to the generic device protocol. This method will return 1 of 3 possible states: `unknown`, `initialized`, or `failed` allowing downstream services to check the state of a device before attempting to use it. | ||
|
|
||
| #### Example Implementation | ||
|
|
||
| `protos/device.py` | ||
| ```python | ||
| class DeviceState: | ||
| unknown = 0 | ||
| initialized = 1 | ||
| failed = 2 | ||
|
|
||
| class DeviceProto: | ||
| def get_state(self) -> int: | ||
| """Get the current state of the device. | ||
|
|
||
| :return: The current state of the device as an integer. Possible values are: | ||
| - 0: Unknown | ||
| - 1: Initialized | ||
| - 2: Error | ||
| :rtype: int | ||
| """ | ||
| ... | ||
| ``` | ||
|
|
||
| `imu/manager/lsm6dsox.py` | ||
| ```diff | ||
| -class LSM6DSOXManager(IMUProto, TemperatureSensorProto): | ||
| +class LSM6DSOXManager(DeviceProto, IMUProto, TemperatureSensorProto): | ||
| """Manager class for creating LIS2MDL IMU instances. | ||
| The purpose of the manager class is to hide the complexity of IMU initialization from the caller. | ||
| Specifically we should try to keep adafruit_lsm6ds to only this manager class. | ||
| @@ -40,9 +41,19 @@ class LSM6DSOXManager(IMUProto, TemperatureSensorProto): | ||
| try: | ||
| self._log.debug("Initializing IMU") | ||
| self._imu: LSM6DSOX = LSM6DSOX(i2c, address) | ||
| + self._state = DeviceState.initialized | ||
| except Exception as e: | ||
| + self._state = DeviceState.failed | ||
| raise HardwareInitializationError("Failed to initialize IMU") from e | ||
|
|
||
| + def get_state(self) -> int: | ||
| + """Get the state of the IMU. | ||
| + | ||
| + :return: The state of the IMU. | ||
| + :rtype: int | ||
| + """ | ||
| + return self._state | ||
| ``` | ||
|
|
||
| `main.py` | ||
| ```py | ||
| from imu.manager.lsm6dsox import LSM6DSOXManager | ||
|
|
||
| # root try/except block | ||
| try: | ||
| try: | ||
| # Initialize the IMU | ||
| imu = LSM6DSOXManager(i2c, address) | ||
| except HardwareInitializationError as e: | ||
| logger.error("Failed to initialize IMU", e) | ||
|
|
||
| future_downstream_service(imu) | ||
|
|
||
| # Continue with the rest of the program | ||
| ``` | ||
|
|
||
| `future_downstream_service.py` | ||
| ```py | ||
| from protos.imu import IMUProto | ||
| from protos.device import DeviceState | ||
|
|
||
| def future_downstream_service(imu: IMUProto): | ||
| if imu.get_state() != DeviceState.initialized: | ||
| # Handle the case where the IMU is not initialized | ||
| pass | ||
|
|
||
| # Do something with the IMU | ||
| ``` | ||
|
|
||
| #### Pros | ||
| - Allows downstream services to check the state of a device before attempting to use it. | ||
| - State is encapsulated in the manager, and the manager is responsible for returning its state. | ||
| - State values are explicitly defined in the protocol and can be reused across all devices. | ||
|
|
||
| #### Neutral | ||
| - Attempting to emulate enum behavior with `DeviceState` in CircuitPython, where they are not officially supported, is not ideal. | ||
|
|
||
| #### Cons | ||
| - It's not obvious that the device manager will always be safe to call when the device is in a failed state. For example, in the failure case, `self._imu` will not exist, and any attempt to call it will raise an error `AttributeError: 'LSM6DSOXManager' object has no attribute '_imu'`. While all methods on current device managers are safe to call on failed initialization, there is no enforcement mechanism beyond peer review. We may also want device manager methods to fail in the future, and this proposal will prevent that. | ||
| - This pattern requires all device managers to implement the `get_state()` method, which burdens the developer and may lead to inconsistent implementations. | ||
|
|
||
| ### Pattern 2: Default Manager | ||
|
|
||
| We propose to add a failed manager for each type of device. Each failed manager will implement both the protocol for its device and a new `FailedDevice` protocol. Downstream services will be able to check if the device manager is an instance of the `FailedDevice` protocol and handle it accordingly. | ||
|
|
||
| #### Example Implementation | ||
|
|
||
| `protos/failed_device.py` | ||
| ```python | ||
| class FailedDevice: | ||
| """Protocol for a failed device manager.""" | ||
| ... | ||
| ``` | ||
|
|
||
| `imu/manager/failed.py` | ||
| ```py | ||
| from protos.imu import IMUProto | ||
| from protos.failed_device import FailedDevice | ||
| from logging import Logger | ||
|
|
||
| class FailedIMUManager(IMUProto, FailedDevice): | ||
| """Manager class for an IMU instance that failed to initialize.""" | ||
| def __init__(self, logger: Logger): | ||
| self._log = logger | ||
|
|
||
| def get_gyro_data(self) -> tuple[float, float, float] | None: | ||
| """Get the gyro data from the IMU. | ||
|
|
||
| :return: IMU is not initialized, will always return None. | ||
| :rtype: None | ||
| """ | ||
| self._log.warning("IMU is not initialized, gyro data not accessible") | ||
| return None | ||
| ``` | ||
|
|
||
|
|
||
| `main.py` | ||
| ```py | ||
| from imu.manager.lsm6dsox import LSM6DSOXManager | ||
| from imu.manager.failed import FailedIMUManager | ||
|
|
||
| # root try/except block | ||
| try: | ||
| try: | ||
| # Initialize the IMU | ||
| imu = LSM6DSOXManager(i2c, address) | ||
| except HardwareInitializationError as e: | ||
| imu = FailedIMUManager(logger) | ||
| logger.error("Failed to initialize IMU", e) | ||
|
|
||
| future_downstream_service(imu) | ||
|
|
||
| # Continue with the rest of the program | ||
| ``` | ||
|
|
||
| `future_downstream_service.py` | ||
| ```py | ||
| from protos.imu import IMUProto | ||
| from imu.manager.lsm6dsox import LSM6DSOXManager | ||
| from imu.manager.failed import FailedIMUManager | ||
|
|
||
| def future_downstream_service(imu: IMUProto): | ||
| if isinstance(imu, FailedIMUManager): | ||
| # Handle the case where the IMU is not initialized | ||
| pass | ||
|
|
||
| # Do something with the IMU | ||
| ``` | ||
|
|
||
| #### Pros | ||
| - Allows downstream services to check the state of a device before attempting to use it. | ||
| - It guarantees that device manager methods will always be safe to call when the device is in a failed state. The `Failed...Manager` is a specific protocol implementation that only needs to be written once and can be reused for all devices of the same type. | ||
| - This pattern does not require all device managers to implement the `get_state()` method and may lead to inconsistent implementations. | ||
|
|
||
| #### Neutral | ||
| - This pattern requires PySquared users to return a `FailedDevice` when dealing with a `HardwareInitializationError`. However, this still allows users to handle the error differently if they want to. | ||
|
|
||
| #### Cons | ||
| - Requires boilerplate `try/except` logic in the device initialization code to instantiate the `Failed...Manager`. | ||
| - Relies on `isinstance` checks downstream. | ||
|
|
||
| ## Decision | ||
|
|
||
| TBD | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.