-
Notifications
You must be signed in to change notification settings - Fork 43
Implement retry policy and enhance errored state handling #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
8f7bcc1
Implement retry policy and enhance errored state handling
NiveditJain a00b97c
Merge remote-tracking branch 'origin/main' into retries
NiveditJain f6a3bb2
Refactor retry policy and errored state handling
NiveditJain 66c794a
Add retry policy documentation and integrate into graph configuration
NiveditJain 728b225
Enhance retry policy error handling and validation
NiveditJain ba13785
Update retry policy documentation and examples
NiveditJain 6e02d50
Enhance retry policy implementation and documentation
NiveditJain 64592d8
Enhance errored state handling with retry state management
NiveditJain 33e75d3
Update state-manager/app/models/db/state.py
NiveditJain 0c16bd7
Update docs/docs/exosphere/retry-policy.md
NiveditJain b0cabb0
Update docs/docs/exosphere/retry-policy.md
NiveditJain 74da1b8
Update state-manager/app/controller/errored_state.py
NiveditJain 06ee0a3
Update state-manager/app/controller/errored_state.py
NiveditJain fded75b
Update state-manager/app/models/retry_policy_model.py
NiveditJain e56d5f5
Update max_delay description in RetryPolicyModel to clarify behavior …
NiveditJain 42efd68
Refine documentation for retry policy and errored state handling
NiveditJain 4ec30ab
Enhance tests for errored state and upsert graph template
NiveditJain b1b85d1
Refactor test for RetryPolicyModel by removing unnecessary import
NiveditJain 3806d26
Add comprehensive tests for errored state handling in graph templates
NiveditJain 4048228
Refactor assertions in errored state tests for clarity
NiveditJain 462cfde
Remove Kubernetes deployment steps from the publish workflow
NiveditJain File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,276 @@ | ||
| # Retry Policy | ||
|
|
||
| !!! beta "Beta Feature" | ||
| Retry Policy is currently available in beta. The API and functionality may change in future releases. | ||
|
|
||
| The Retry Policy feature in Exosphere provides sophisticated retry mechanisms for handling transient failures in your workflow nodes. When a node execution fails, the retry policy automatically determines when and how to retry the execution based on configurable strategies. | ||
|
|
||
| ## Overview | ||
|
|
||
| Retry policies are configured at the graph level and apply to all nodes within that graph. When a node fails with an error, the state manager automatically creates a retry state with a calculated delay before the next execution attempt. | ||
|
|
||
| ## Configuration | ||
|
|
||
| Retry policies are defined in your graph template configuration: | ||
|
|
||
| ```json | ||
| { | ||
| "secrets": { | ||
| "api_key": "your-api-key" | ||
| }, | ||
|
NiveditJain marked this conversation as resolved.
|
||
| "nodes": [ | ||
| { | ||
| "node_name": "MyNode", | ||
| "namespace": "MyProject", | ||
| "identifier": "my_node", | ||
| "inputs": { | ||
| "data": "initial" | ||
| }, | ||
| "next_nodes": [] | ||
| } | ||
| ], | ||
| "retry_policy": { | ||
| "max_retries": 3, | ||
| "strategy": "EXPONENTIAL", | ||
| "backoff_factor": 2000, | ||
| "exponent": 2 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| ### max_retries | ||
| - **Type**: `int` | ||
| - **Default**: `3` | ||
| - **Description**: The maximum number of retry attempts before giving up | ||
| - **Constraints**: Must be >= 0 | ||
|
|
||
| ### strategy | ||
| - **Type**: `string` | ||
| - **Default**: `"EXPONENTIAL"` | ||
| - **Description**: The retry strategy to use for calculating delays | ||
| - **Options**: See [Retry Strategies](#retry-strategies) below | ||
|
|
||
| ### backoff_factor | ||
| - **Type**: `int` | ||
| - **Default**: `2000` (2 seconds) | ||
| - **Description**: The base delay factor in milliseconds | ||
| - **Constraints**: Must be > 0 | ||
|
|
||
| ### exponent | ||
| - **Type**: `int` | ||
| - **Default**: `2` | ||
| - **Description**: The exponent used for exponential strategies | ||
| - **Constraints**: Must be > 0 | ||
|
|
||
|
NiveditJain marked this conversation as resolved.
|
||
| ## Retry Strategies | ||
|
|
||
| Exosphere supports three main categories of retry strategies, each with jitter variants to prevent thundering herd problems. | ||
|
|
||
| ### Exponential Strategies | ||
|
|
||
| Exponential strategies increase the delay exponentially with each retry attempt. | ||
|
|
||
| #### EXPONENTIAL | ||
| Standard exponential backoff without jitter. | ||
|
|
||
| **Formula**: `backoff_factor * (exponent ^ retry_count)` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 2000ms (2 seconds) | ||
|
NiveditJain marked this conversation as resolved.
|
||
| - Retry 2: 4000ms (4 seconds) | ||
| - Retry 3: 8000ms (8 seconds) | ||
|
|
||
| #### EXPONENTIAL_FULL_JITTER | ||
| Exponential backoff with full jitter (random delay between 0 and calculated delay). | ||
|
|
||
| **Formula**: `random(0, backoff_factor * (exponent ^ retry_count))` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 0-2000ms (random) | ||
| - Retry 2: 0-4000ms (random) | ||
| - Retry 3: 0-8000ms (random) | ||
|
|
||
| #### EXPONENTIAL_EQUAL_JITTER | ||
| Exponential backoff with equal jitter (random delay around half the calculated delay). | ||
|
|
||
| **Formula**: `(backoff_factor * (exponent ^ retry_count)) / 2 + random(0, (backoff_factor * (exponent ^ retry_count)) / 2)` | ||
|
|
||
|
NiveditJain marked this conversation as resolved.
|
||
| **Example**: | ||
| - Retry 1: 1000-2000ms (random) | ||
|
NiveditJain marked this conversation as resolved.
|
||
| - Retry 2: 2000-4000ms (random) | ||
| - Retry 3: 4000-8000ms (random) | ||
|
|
||
| ### Linear Strategies | ||
|
|
||
| Linear strategies increase the delay linearly with each retry attempt. | ||
|
|
||
| #### LINEAR | ||
| Standard linear backoff without jitter. | ||
|
|
||
| **Formula**: `backoff_factor * retry_count` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 2000ms (2 seconds) | ||
| - Retry 2: 4000ms (4 seconds) | ||
| - Retry 3: 6000ms (6 seconds) | ||
|
|
||
| #### LINEAR_FULL_JITTER | ||
| Linear backoff with full jitter. | ||
|
|
||
| **Formula**: `random(0, backoff_factor * retry_count)` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 0-2000ms (random) | ||
| - Retry 2: 0-4000ms (random) | ||
| - Retry 3: 0-6000ms (random) | ||
|
|
||
| #### LINEAR_EQUAL_JITTER | ||
| Linear backoff with equal jitter. | ||
|
|
||
| **Formula**: `(backoff_factor * retry_count) / 2 + random(0, (backoff_factor * retry_count) / 2)` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 1000-2000ms (random) | ||
| - Retry 2: 2000-4000ms (random) | ||
| - Retry 3: 3000-6000ms (random) | ||
|
|
||
| ### Fixed Strategies | ||
|
|
||
| Fixed strategies use a constant delay for all retry attempts. | ||
|
|
||
| #### FIXED | ||
| Standard fixed delay without jitter. | ||
|
|
||
| **Formula**: `backoff_factor` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 2000ms (2 seconds) | ||
| - Retry 2: 2000ms (2 seconds) | ||
| - Retry 3: 2000ms (2 seconds) | ||
|
|
||
| #### FIXED_FULL_JITTER | ||
| Fixed delay with full jitter. | ||
|
|
||
| **Formula**: `random(0, backoff_factor)` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 0-2000ms (random) | ||
| - Retry 2: 0-2000ms (random) | ||
| - Retry 3: 0-2000ms (random) | ||
|
|
||
| #### FIXED_EQUAL_JITTER | ||
| Fixed delay with equal jitter. | ||
|
|
||
| **Formula**: `backoff_factor / 2 + random(0, backoff_factor / 2)` | ||
|
|
||
| **Example**: | ||
| - Retry 1: 1000-2000ms (random) | ||
| - Retry 2: 1000-2000ms (random) | ||
| - Retry 3: 1000-2000ms (random) | ||
|
|
||
| ## Usage Examples | ||
|
|
||
| ### Basic Exponential Retry | ||
| ```json | ||
| { | ||
| "retry_policy": { | ||
| "max_retries": 3, | ||
| "strategy": "EXPONENTIAL", | ||
| "backoff_factor": 1000, | ||
| "exponent": 2 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### Aggressive Retry with Jitter | ||
| ```json | ||
| { | ||
| "retry_policy": { | ||
| "max_retries": 5, | ||
| "strategy": "EXPONENTIAL_FULL_JITTER", | ||
| "backoff_factor": 500, | ||
| "exponent": 3 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### Conservative Linear Retry | ||
| ```json | ||
| { | ||
| "retry_policy": { | ||
| "max_retries": 2, | ||
| "strategy": "LINEAR", | ||
| "backoff_factor": 5000 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### Fixed Retry for Rate Limiting | ||
| ```json | ||
| { | ||
| "retry_policy": { | ||
| "max_retries": 10, | ||
| "strategy": "FIXED_EQUAL_JITTER", | ||
| "backoff_factor": 1000 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## When Retries Are Triggered | ||
|
|
||
| Retries are automatically triggered when: | ||
|
|
||
| 1. A node execution fails with an error | ||
| 2. The current retry count is less than `max_retries` | ||
| 3. The state status is `QUEUED` or `EXECUTED` | ||
|
NiveditJain marked this conversation as resolved.
Outdated
|
||
|
|
||
| The retry mechanism: | ||
| - Creates a new state with `retry_count` incremented by 1 | ||
| - Sets `enqueue_after` to the current time plus the calculated delay | ||
| - Sets the original state status to `ERRORED` with the error message | ||
|
|
||
| ## Best Practices | ||
|
|
||
| ### Choose the Right Strategy | ||
| - **EXPONENTIAL**: Best for most transient failures (network issues, temporary service unavailability) | ||
| - **LINEAR**: Good for predictable, consistent delays | ||
| - **FIXED**: Useful for rate limiting scenarios | ||
|
|
||
| ### Use Jitter for High Concurrency | ||
| - **FULL_JITTER**: Best for high concurrency to prevent thundering herd | ||
| - **EQUAL_JITTER**: Good balance between predictability and randomization | ||
| - **No Jitter**: Use only when you need deterministic behavior | ||
|
|
||
| ### Set Appropriate Limits | ||
| - **max_retries**: Consider the nature of your failures and downstream dependencies | ||
| - **backoff_factor**: Balance between responsiveness and resource usage | ||
| - **exponent**: Higher values create more aggressive backoff | ||
|
|
||
| ### Monitor Retry Patterns | ||
| - Track retry counts in your monitoring system | ||
| - Set up alerts for graphs with high retry rates | ||
| - Analyze retry patterns to identify systemic issues | ||
|
|
||
| ## Limitations | ||
|
|
||
| - Retry policies apply to all nodes in a graph uniformly | ||
| - Individual node-level retry policies are not supported | ||
| - Retry delays are calculated in milliseconds | ||
| - Maximum delay is not capped (consider using reasonable `backoff_factor` and `exponent` values) | ||
|
|
||
| ## Error Handling | ||
|
|
||
| If a retry policy configuration is invalid: | ||
| - The graph template validation will fail | ||
| - An error will be returned during graph creation | ||
| - The graph will not be saved until the configuration is corrected | ||
|
|
||
| ## Integration with Signals | ||
|
|
||
| Retry policies work alongside Exosphere's signal system: | ||
|
|
||
| - Nodes can still raise `PruneSignal` to stop retries immediately | ||
| - Nodes can raise `ReQueueAfterSignal` to re-queue after sometime, this will not mark nodes as failure. | ||
| - The retry count is preserved when using signals | ||
|
NiveditJain marked this conversation as resolved.
Outdated
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.