FailproofAI · NiveditJain · Aug 31, 2025 · Aug 31, 2025 · Aug 31, 2025 · Aug 31, 2025
diff --git a/docs/docs/exosphere/create-graph.md b/docs/docs/exosphere/create-graph.md
@@ -51,7 +51,13 @@ One can define a graph on Exosphere through a simple json config, which specifie
       },
       "next_nodes": []
     }
-  ]
+  ],
+  "retry_policy": {
+    "max_retries": 3,
+    "strategy": "EXPONENTIAL",
+    "backoff_factor": 2000,
+    "exponent": 2
+  }
 }
 ```
 
@@ -126,6 +132,23 @@ Use the `${{ ... }}` syntax to map outputs from previous nodes:
 - **`${{ node_identifier.outputs.field_name }}`**: Maps output from a specific node
 - **`initial`**: Static value provided when the graph is triggered
 - **Direct values**: String values. In v1, numbers/booleans must be string-encoded (e.g., "42", "true").
+
+### Retry Policy
+
+Graphs can include a retry policy to handle transient failures automatically. The retry policy is configured at the graph level and applies to all nodes within the graph.
+
+```json
+{
+  "retry_policy": {
+    "max_retries": 3,
+    "strategy": "EXPONENTIAL",
+    "backoff_factor": 2000,
+    "exponent": 2
+  }
+}
+```
+
+For detailed information about retry policies, including all available strategies and configuration options, see the [Retry Policy](retry-policy.md) documentation.
 ## Creating Graph Templates
 
 The recommended way to create graph templates is using the Exosphere Python SDK, which provides a clean interface to the State Manager API.

diff --git a/docs/docs/exosphere/retry-policy.md b/docs/docs/exosphere/retry-policy.md
@@ -0,0 +1,276 @@
+# Retry Policy
+
+!!! beta "Beta Feature"
+    Retry Policy is currently available in beta. The API and functionality may change in future releases.
+
+The Retry Policy feature in Exosphere provides sophisticated retry mechanisms for handling transient failures in your workflow nodes. When a node execution fails, the retry policy automatically determines when and how to retry the execution based on configurable strategies.
+
+## Overview
+
+Retry policies are configured at the graph level and apply to all nodes within that graph. When a node fails with an error, the state manager automatically creates a retry state with a calculated delay before the next execution attempt.
+
+## Configuration
+
+Retry policies are defined in your graph template configuration:
+
+```json
+{
+  "secrets": {
+    "api_key": "your-api-key"
+  },
+  "nodes": [
+    {
+      "node_name": "MyNode",
+      "namespace": "MyProject",
+      "identifier": "my_node",
+      "inputs": {
+        "data": "initial"
+      },
+      "next_nodes": []
+    }
+  ],
+  "retry_policy": {
+    "max_retries": 3,
+    "strategy": "EXPONENTIAL",
+    "backoff_factor": 2000,
+    "exponent": 2
+  }
+}
+```
+
+## Parameters
+
+### max_retries
+- **Type**: `int`
+- **Default**: `3`
+- **Description**: The maximum number of retry attempts before giving up
+- **Constraints**: Must be >= 0
+
+### strategy
+- **Type**: `string`
+- **Default**: `"EXPONENTIAL"`
+- **Description**: The retry strategy to use for calculating delays
+- **Options**: See [Retry Strategies](#retry-strategies) below
+
+### backoff_factor
+- **Type**: `int`
+- **Default**: `2000` (2 seconds)
+- **Description**: The base delay factor in milliseconds
+- **Constraints**: Must be > 0
+
+### exponent
+- **Type**: `int`
+- **Default**: `2`
+- **Description**: The exponent used for exponential strategies
+- **Constraints**: Must be > 0
+
+## Retry Strategies
+
+Exosphere supports three main categories of retry strategies, each with jitter variants to prevent thundering herd problems.
+
+### Exponential Strategies
+
+Exponential strategies increase the delay exponentially with each retry attempt.
+
+#### EXPONENTIAL
+Standard exponential backoff without jitter.
+
+**Formula**: `backoff_factor * (exponent ^ retry_count)`
+
+**Example**:
+- Retry 1: 2000ms (2 seconds)
+- Retry 2: 4000ms (4 seconds)
+- Retry 3: 8000ms (8 seconds)
+
+#### EXPONENTIAL_FULL_JITTER
+Exponential backoff with full jitter (random delay between 0 and calculated delay).
+
+**Formula**: `random(0, backoff_factor * (exponent ^ retry_count))`
+
+**Example**:
+- Retry 1: 0-2000ms (random)
+- Retry 2: 0-4000ms (random)
+- Retry 3: 0-8000ms (random)
+
+#### EXPONENTIAL_EQUAL_JITTER
+Exponential backoff with equal jitter (random delay around half the calculated delay).
+
+**Formula**: `(backoff_factor * (exponent ^ retry_count)) / 2 + random(0, (backoff_factor * (exponent ^ retry_count)) / 2)`
+
+**Example**:
+- Retry 1: 1000-2000ms (random)
+- Retry 2: 2000-4000ms (random)
+- Retry 3: 4000-8000ms (random)
+
+### Linear Strategies
+
+Linear strategies increase the delay linearly with each retry attempt.
+
+#### LINEAR
+Standard linear backoff without jitter.
+
+**Formula**: `backoff_factor * retry_count`
+
+**Example**:
+- Retry 1: 2000ms (2 seconds)
+- Retry 2: 4000ms (4 seconds)
+- Retry 3: 6000ms (6 seconds)
+
+#### LINEAR_FULL_JITTER
+Linear backoff with full jitter.
+
+**Formula**: `random(0, backoff_factor * retry_count)`
+
+**Example**:
+- Retry 1: 0-2000ms (random)
+- Retry 2: 0-4000ms (random)
+- Retry 3: 0-6000ms (random)
+
+#### LINEAR_EQUAL_JITTER
+Linear backoff with equal jitter.
+
+**Formula**: `(backoff_factor * retry_count) / 2 + random(0, (backoff_factor * retry_count) / 2)`
+
+**Example**:
+- Retry 1: 1000-2000ms (random)
+- Retry 2: 2000-4000ms (random)
+- Retry 3: 3000-6000ms (random)
+
+### Fixed Strategies
+
+Fixed strategies use a constant delay for all retry attempts.
+
+#### FIXED
+Standard fixed delay without jitter.
+
+**Formula**: `backoff_factor`
+
+**Example**:
+- Retry 1: 2000ms (2 seconds)
+- Retry 2: 2000ms (2 seconds)
+- Retry 3: 2000ms (2 seconds)
+
+#### FIXED_FULL_JITTER
+Fixed delay with full jitter.
+
+**Formula**: `random(0, backoff_factor)`
+
+**Example**:
+- Retry 1: 0-2000ms (random)
+- Retry 2: 0-2000ms (random)
+- Retry 3: 0-2000ms (random)
+
+#### FIXED_EQUAL_JITTER
+Fixed delay with equal jitter.
+
+**Formula**: `backoff_factor / 2 + random(0, backoff_factor / 2)`
+
+**Example**:
+- Retry 1: 1000-2000ms (random)
+- Retry 2: 1000-2000ms (random)
+- Retry 3: 1000-2000ms (random)
+
+## Usage Examples
+
+### Basic Exponential Retry
+```json
+{
+  "retry_policy": {
+    "max_retries": 3,
+    "strategy": "EXPONENTIAL",
+    "backoff_factor": 1000,
+    "exponent": 2
+  }
+}
+```
+
+### Aggressive Retry with Jitter
+```json
+{
+  "retry_policy": {
+    "max_retries": 5,
+    "strategy": "EXPONENTIAL_FULL_JITTER",
+    "backoff_factor": 500,
+    "exponent": 3
+  }
+}
+```
+
+### Conservative Linear Retry
+```json
+{
+  "retry_policy": {
+    "max_retries": 2,
+    "strategy": "LINEAR",
+    "backoff_factor": 5000
+  }
+}
+```
+
+### Fixed Retry for Rate Limiting
+```json
+{
+  "retry_policy": {
+    "max_retries": 10,
+    "strategy": "FIXED_EQUAL_JITTER",
+    "backoff_factor": 1000
+  }
+}
+```
+
+## When Retries Are Triggered
+
+Retries are automatically triggered when:
+
+1. A node execution fails with an error
+2. The current retry count is less than `max_retries`
+3. The state status is `QUEUED` or `EXECUTED`
+
+The retry mechanism:
+- Creates a new state with `retry_count` incremented by 1
+- Sets `enqueue_after` to the current time plus the calculated delay
+- Sets the original state status to `ERRORED` with the error message
+
+## Best Practices
+
+### Choose the Right Strategy
+- **EXPONENTIAL**: Best for most transient failures (network issues, temporary service unavailability)
+- **LINEAR**: Good for predictable, consistent delays
+- **FIXED**: Useful for rate limiting scenarios
+
+### Use Jitter for High Concurrency
+- **FULL_JITTER**: Best for high concurrency to prevent thundering herd
+- **EQUAL_JITTER**: Good balance between predictability and randomization
+- **No Jitter**: Use only when you need deterministic behavior
+
+### Set Appropriate Limits
+- **max_retries**: Consider the nature of your failures and downstream dependencies
+- **backoff_factor**: Balance between responsiveness and resource usage
+- **exponent**: Higher values create more aggressive backoff
+
+### Monitor Retry Patterns
+- Track retry counts in your monitoring system
+- Set up alerts for graphs with high retry rates
+- Analyze retry patterns to identify systemic issues
+
+## Limitations
+
+- Retry policies apply to all nodes in a graph uniformly
+- Individual node-level retry policies are not supported
+- Retry delays are calculated in milliseconds
+- Maximum delay is not capped (consider using reasonable `backoff_factor` and `exponent` values)
+
+## Error Handling
+
+If a retry policy configuration is invalid:
+- The graph template validation will fail
+- An error will be returned during graph creation
+- The graph will not be saved until the configuration is corrected
+
+## Integration with Signals
+
+Retry policies work alongside Exosphere's signal system:
+
+- Nodes can still raise `PruneSignal` to stop retries immediately
+- Nodes can raise `ReQueueAfterSignal` to re-queue after sometime, this will not mark nodes as failure.
+- The retry count is preserved when using signals 
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -101,6 +101,7 @@ plugins:
           - exosphere/register-node.md
           - exosphere/create-runtime.md
           - exosphere/create-graph.md
+          - exosphere/retry-policy.md
           - exosphere/trigger-graph.md
           - exosphere/dashboard.md
           - exosphere/signals.md
@@ -129,6 +130,7 @@ nav:
   - Register Node: exosphere/register-node.md 
   - Create Runtime: exosphere/create-runtime.md
   - Create Graph: exosphere/create-graph.md
+  - Retry Policy: exosphere/retry-policy.md
   - Trigger Graph: exosphere/trigger-graph.md  
   - Dashboard: exosphere/dashboard.md
   - Signals: exosphere/signals.md

diff --git a/state-manager/app/controller/errored_state.py b/state-manager/app/controller/errored_state.py
@@ -1,10 +1,13 @@
+import time
+
 from app.models.errored_models import ErroredRequestModel, ErroredResponseModel
 from fastapi import HTTPException, status
 from beanie import PydanticObjectId
 
 from app.models.db.state import State
 from app.models.state_status_enum import StateStatusEnum
 from app.singletons.logs_manager import LogsManager
+from app.models.db.graph_template_model import GraphTemplate
 
 logger = LogsManager().get_logger()
 
@@ -23,11 +26,35 @@ async def errored_state(namespace_name: str, state_id: PydanticObjectId, body: E
         if state.status == StateStatusEnum.EXECUTED:
             raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="State is already executed")
 
+        graph_template = await GraphTemplate.get(namespace_name, state.graph_name)
+
+        retry_created = False
+
+        if state.retry_count < graph_template.retry_policy.max_retries:
+            retry_state = State(
+                node_name=state.node_name,
+                namespace_name=state.namespace_name,
+                identifier=state.identifier,
+                graph_name=state.graph_name,
+                run_id=state.run_id,
+                status=StateStatusEnum.CREATED,
+                inputs=state.inputs,
+                outputs={},
+                error=None,
+                parents=state.parents,
+                does_unites=state.does_unites,
+                enqueue_after= int(time.time() * 1000) + graph_template.retry_policy.compute_delay(state.retry_count + 1),
+                retry_count=state.retry_count + 1
+            )
+            retry_state = await retry_state.insert()
+            logger.info(f"Retry state {retry_state.id} created for state {state_id}", x_exosphere_request_id=x_exosphere_request_id)
+            retry_created = True
+
         state.status = StateStatusEnum.ERRORED
         state.error = body.error
         await state.save()
 
-        return ErroredResponseModel(status=StateStatusEnum.ERRORED)
+        return ErroredResponseModel(status=StateStatusEnum.ERRORED, retry_created=retry_created)
 
     except Exception as e:
         logger.error(f"Error errored state {state_id} for namespace {namespace_name}", x_exosphere_request_id=x_exosphere_request_id, error=e)