Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
8f7bcc1
Implement retry policy and enhance errored state handling
NiveditJain Aug 31, 2025
a00b97c
Merge remote-tracking branch 'origin/main' into retries
NiveditJain Aug 31, 2025
f6a3bb2
Refactor retry policy and errored state handling
NiveditJain Aug 31, 2025
66c794a
Add retry policy documentation and integrate into graph configuration
NiveditJain Aug 31, 2025
728b225
Enhance retry policy error handling and validation
NiveditJain Aug 31, 2025
ba13785
Update retry policy documentation and examples
NiveditJain Aug 31, 2025
6e02d50
Enhance retry policy implementation and documentation
NiveditJain Aug 31, 2025
64592d8
Enhance errored state handling with retry state management
NiveditJain Aug 31, 2025
33e75d3
Update state-manager/app/models/db/state.py
NiveditJain Aug 31, 2025
0c16bd7
Update docs/docs/exosphere/retry-policy.md
NiveditJain Aug 31, 2025
b0cabb0
Update docs/docs/exosphere/retry-policy.md
NiveditJain Aug 31, 2025
74da1b8
Update state-manager/app/controller/errored_state.py
NiveditJain Aug 31, 2025
06ee0a3
Update state-manager/app/controller/errored_state.py
NiveditJain Aug 31, 2025
fded75b
Update state-manager/app/models/retry_policy_model.py
NiveditJain Aug 31, 2025
e56d5f5
Update max_delay description in RetryPolicyModel to clarify behavior …
NiveditJain Aug 31, 2025
42efd68
Refine documentation for retry policy and errored state handling
NiveditJain Aug 31, 2025
4ec30ab
Enhance tests for errored state and upsert graph template
NiveditJain Aug 31, 2025
b1b85d1
Refactor test for RetryPolicyModel by removing unnecessary import
NiveditJain Aug 31, 2025
3806d26
Add comprehensive tests for errored state handling in graph templates
NiveditJain Aug 31, 2025
4048228
Refactor assertions in errored state tests for clarity
NiveditJain Aug 31, 2025
462cfde
Remove Kubernetes deployment steps from the publish workflow
NiveditJain Aug 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion docs/docs/exosphere/create-graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,13 @@ One can define a graph on Exosphere through a simple json config, which specifie
},
"next_nodes": []
}
]
],
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL",
"backoff_factor": 2000,
"exponent": 2
}
Comment thread
NiveditJain marked this conversation as resolved.
}
```

Expand Down Expand Up @@ -126,6 +132,23 @@ Use the `${{ ... }}` syntax to map outputs from previous nodes:
- **`${{ node_identifier.outputs.field_name }}`**: Maps output from a specific node
- **`initial`**: Static value provided when the graph is triggered
- **Direct values**: String values. In v1, numbers/booleans must be string-encoded (e.g., "42", "true").

### Retry Policy

Graphs can include a retry policy to handle transient failures automatically. The retry policy is configured at the graph level and applies to all nodes within the graph.

```json
{
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL",
"backoff_factor": 2000,
"exponent": 2
}
}
```

For detailed information about retry policies, including all available strategies and configuration options, see the [Retry Policy](retry-policy.md) documentation.
Comment thread
NiveditJain marked this conversation as resolved.
## Creating Graph Templates

The recommended way to create graph templates is using the Exosphere Python SDK, which provides a clean interface to the State Manager API.
Expand Down
276 changes: 276 additions & 0 deletions docs/docs/exosphere/retry-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Retry Policy

!!! beta "Beta Feature"
Retry Policy is currently available in beta. The API and functionality may change in future releases.

The Retry Policy feature in Exosphere provides sophisticated retry mechanisms for handling transient failures in your workflow nodes. When a node execution fails, the retry policy automatically determines when and how to retry the execution based on configurable strategies.

## Overview

Retry policies are configured at the graph level and apply to all nodes within that graph. When a node fails with an error, the state manager automatically creates a retry state with a calculated delay before the next execution attempt.

## Configuration

Retry policies are defined in your graph template configuration:

```json
{
"secrets": {
"api_key": "your-api-key"
},
Comment thread
NiveditJain marked this conversation as resolved.
"nodes": [
{
"node_name": "MyNode",
"namespace": "MyProject",
"identifier": "my_node",
"inputs": {
"data": "initial"
},
"next_nodes": []
}
],
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL",
"backoff_factor": 2000,
"exponent": 2
}
}
```

## Parameters

### max_retries
- **Type**: `int`
- **Default**: `3`
- **Description**: The maximum number of retry attempts before giving up
- **Constraints**: Must be >= 0

### strategy
- **Type**: `string`
- **Default**: `"EXPONENTIAL"`
- **Description**: The retry strategy to use for calculating delays
- **Options**: See [Retry Strategies](#retry-strategies) below

### backoff_factor
- **Type**: `int`
- **Default**: `2000` (2 seconds)
- **Description**: The base delay factor in milliseconds
- **Constraints**: Must be > 0

### exponent
- **Type**: `int`
- **Default**: `2`
- **Description**: The exponent used for exponential strategies
- **Constraints**: Must be > 0

Comment thread
NiveditJain marked this conversation as resolved.
## Retry Strategies

Exosphere supports three main categories of retry strategies, each with jitter variants to prevent thundering herd problems.

### Exponential Strategies

Exponential strategies increase the delay exponentially with each retry attempt.

#### EXPONENTIAL
Standard exponential backoff without jitter.

**Formula**: `backoff_factor * (exponent ^ retry_count)`

**Example**:
- Retry 1: 2000ms (2 seconds)
Comment thread
NiveditJain marked this conversation as resolved.
- Retry 2: 4000ms (4 seconds)
- Retry 3: 8000ms (8 seconds)

#### EXPONENTIAL_FULL_JITTER
Exponential backoff with full jitter (random delay between 0 and calculated delay).

**Formula**: `random(0, backoff_factor * (exponent ^ retry_count))`

**Example**:
- Retry 1: 0-2000ms (random)
- Retry 2: 0-4000ms (random)
- Retry 3: 0-8000ms (random)

#### EXPONENTIAL_EQUAL_JITTER
Exponential backoff with equal jitter (random delay around half the calculated delay).

**Formula**: `(backoff_factor * (exponent ^ retry_count)) / 2 + random(0, (backoff_factor * (exponent ^ retry_count)) / 2)`

Comment thread
NiveditJain marked this conversation as resolved.
**Example**:
- Retry 1: 1000-2000ms (random)
Comment thread
NiveditJain marked this conversation as resolved.
- Retry 2: 2000-4000ms (random)
- Retry 3: 4000-8000ms (random)

### Linear Strategies

Linear strategies increase the delay linearly with each retry attempt.

#### LINEAR
Standard linear backoff without jitter.

**Formula**: `backoff_factor * retry_count`

**Example**:
- Retry 1: 2000ms (2 seconds)
- Retry 2: 4000ms (4 seconds)
- Retry 3: 6000ms (6 seconds)

#### LINEAR_FULL_JITTER
Linear backoff with full jitter.

**Formula**: `random(0, backoff_factor * retry_count)`

**Example**:
- Retry 1: 0-2000ms (random)
- Retry 2: 0-4000ms (random)
- Retry 3: 0-6000ms (random)

#### LINEAR_EQUAL_JITTER
Linear backoff with equal jitter.

**Formula**: `(backoff_factor * retry_count) / 2 + random(0, (backoff_factor * retry_count) / 2)`

**Example**:
- Retry 1: 1000-2000ms (random)
- Retry 2: 2000-4000ms (random)
- Retry 3: 3000-6000ms (random)

### Fixed Strategies

Fixed strategies use a constant delay for all retry attempts.

#### FIXED
Standard fixed delay without jitter.

**Formula**: `backoff_factor`

**Example**:
- Retry 1: 2000ms (2 seconds)
- Retry 2: 2000ms (2 seconds)
- Retry 3: 2000ms (2 seconds)

#### FIXED_FULL_JITTER
Fixed delay with full jitter.

**Formula**: `random(0, backoff_factor)`

**Example**:
- Retry 1: 0-2000ms (random)
- Retry 2: 0-2000ms (random)
- Retry 3: 0-2000ms (random)

#### FIXED_EQUAL_JITTER
Fixed delay with equal jitter.

**Formula**: `backoff_factor / 2 + random(0, backoff_factor / 2)`

**Example**:
- Retry 1: 1000-2000ms (random)
- Retry 2: 1000-2000ms (random)
- Retry 3: 1000-2000ms (random)

## Usage Examples

### Basic Exponential Retry
```json
{
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL",
"backoff_factor": 1000,
"exponent": 2
}
}
```

### Aggressive Retry with Jitter
```json
{
"retry_policy": {
"max_retries": 5,
"strategy": "EXPONENTIAL_FULL_JITTER",
"backoff_factor": 500,
"exponent": 3
}
}
```

### Conservative Linear Retry
```json
{
"retry_policy": {
"max_retries": 2,
"strategy": "LINEAR",
"backoff_factor": 5000
}
}
```

### Fixed Retry for Rate Limiting
```json
{
"retry_policy": {
"max_retries": 10,
"strategy": "FIXED_EQUAL_JITTER",
"backoff_factor": 1000
}
}
```

## When Retries Are Triggered

Retries are automatically triggered when:

1. A node execution fails with an error
2. The current retry count is less than `max_retries`
3. The state status is `QUEUED` or `EXECUTED`
Comment thread
NiveditJain marked this conversation as resolved.
Outdated

The retry mechanism:
- Creates a new state with `retry_count` incremented by 1
- Sets `enqueue_after` to the current time plus the calculated delay
- Sets the original state status to `ERRORED` with the error message

## Best Practices

### Choose the Right Strategy
- **EXPONENTIAL**: Best for most transient failures (network issues, temporary service unavailability)
- **LINEAR**: Good for predictable, consistent delays
- **FIXED**: Useful for rate limiting scenarios

### Use Jitter for High Concurrency
- **FULL_JITTER**: Best for high concurrency to prevent thundering herd
- **EQUAL_JITTER**: Good balance between predictability and randomization
- **No Jitter**: Use only when you need deterministic behavior

### Set Appropriate Limits
- **max_retries**: Consider the nature of your failures and downstream dependencies
- **backoff_factor**: Balance between responsiveness and resource usage
- **exponent**: Higher values create more aggressive backoff

### Monitor Retry Patterns
- Track retry counts in your monitoring system
- Set up alerts for graphs with high retry rates
- Analyze retry patterns to identify systemic issues

## Limitations

- Retry policies apply to all nodes in a graph uniformly
- Individual node-level retry policies are not supported
- Retry delays are calculated in milliseconds
- Maximum delay is not capped (consider using reasonable `backoff_factor` and `exponent` values)

## Error Handling

If a retry policy configuration is invalid:
- The graph template validation will fail
- An error will be returned during graph creation
- The graph will not be saved until the configuration is corrected

## Integration with Signals

Retry policies work alongside Exosphere's signal system:

- Nodes can still raise `PruneSignal` to stop retries immediately
- Nodes can raise `ReQueueAfterSignal` to re-queue after sometime, this will not mark nodes as failure.
- The retry count is preserved when using signals
Comment thread
NiveditJain marked this conversation as resolved.
Outdated
2 changes: 2 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ plugins:
- exosphere/register-node.md
- exosphere/create-runtime.md
- exosphere/create-graph.md
- exosphere/retry-policy.md
- exosphere/trigger-graph.md
- exosphere/dashboard.md
- exosphere/signals.md
Expand Down Expand Up @@ -129,6 +130,7 @@ nav:
- Register Node: exosphere/register-node.md
- Create Runtime: exosphere/create-runtime.md
- Create Graph: exosphere/create-graph.md
- Retry Policy: exosphere/retry-policy.md
- Trigger Graph: exosphere/trigger-graph.md
- Dashboard: exosphere/dashboard.md
- Signals: exosphere/signals.md
Expand Down
29 changes: 28 additions & 1 deletion state-manager/app/controller/errored_state.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import time

from app.models.errored_models import ErroredRequestModel, ErroredResponseModel
from fastapi import HTTPException, status
from beanie import PydanticObjectId

from app.models.db.state import State
from app.models.state_status_enum import StateStatusEnum
from app.singletons.logs_manager import LogsManager
from app.models.db.graph_template_model import GraphTemplate

logger = LogsManager().get_logger()

Expand All @@ -23,11 +26,35 @@ async def errored_state(namespace_name: str, state_id: PydanticObjectId, body: E
if state.status == StateStatusEnum.EXECUTED:
raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="State is already executed")

graph_template = await GraphTemplate.get(namespace_name, state.graph_name)
Comment thread
NiveditJain marked this conversation as resolved.
Outdated

retry_created = False

if state.retry_count < graph_template.retry_policy.max_retries:
retry_state = State(
node_name=state.node_name,
namespace_name=state.namespace_name,
identifier=state.identifier,
graph_name=state.graph_name,
run_id=state.run_id,
status=StateStatusEnum.CREATED,
inputs=state.inputs,
outputs={},
error=None,
parents=state.parents,
does_unites=state.does_unites,
enqueue_after= int(time.time() * 1000) + graph_template.retry_policy.compute_delay(state.retry_count + 1),
retry_count=state.retry_count + 1
)
retry_state = await retry_state.insert()
logger.info(f"Retry state {retry_state.id} created for state {state_id}", x_exosphere_request_id=x_exosphere_request_id)
retry_created = True

Comment thread
NiveditJain marked this conversation as resolved.
state.status = StateStatusEnum.ERRORED
state.error = body.error
await state.save()

return ErroredResponseModel(status=StateStatusEnum.ERRORED)
return ErroredResponseModel(status=StateStatusEnum.ERRORED, retry_created=retry_created)

except Exception as e:
logger.error(f"Error errored state {state_id} for namespace {namespace_name}", x_exosphere_request_id=x_exosphere_request_id, error=e)
Expand Down
Loading
Loading