You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Hedge: "Don't cancel on first error" exploration
2
+
3
+
## Current behavior
4
+
5
+
-**CancelIf** in the hedge policy cancels other hedges when:
6
+
- This execution returns **any** non-exhaustion error, or
7
+
- This execution returns an **accepted** result (non-empty, or empty but in `emptyResultAccept` and not consensus).
8
+
- We explicitly **do not** cancel on `ErrUpstreamsExhausted` / `ErrCodeNoUpstreamsLeftToSelect` so other hedges can still finish.
9
+
10
+
So: "first to return (result or error) wins" — we cancel as soon as one execution returns, unless that return is exhaustion.
11
+
12
+
## Why "don't cancel on first error" was not the default
13
+
14
+
1.**Latency when all upstreams fail**
15
+
If we only cancelled on success, then when every hedge fails we would wait for the **slowest** execution instead of returning as soon as the first one fails. Example: Alchemy errors in 100ms, QuickNode in 3s → today we return in ~100ms; if we stopped cancelling on error we’d wait ~3s for the same outcome.
16
+
17
+
2.**Original mental model**
18
+
With a single execution (no hedge), "first response" is the only response. Adding hedge kept the same idea: "first response (success or failure) wins" to avoid extra wait when the first response is already a failure.
19
+
20
+
3.**Loop-over-upstreams**
21
+
Each execution runs a **loop** over upstreams (`maxLoopIterations` = 1 in consensus, `UpstreamsCount()` otherwise). With hedge, executions **share**`ConsumedUpstreams`, so:
22
+
- Execution 1 often gets upstream A, execution 2 gets B.
23
+
- Execution 1 can return after **one** upstream (e.g. error from A, then next `NextUpstream` hits duplicate or exhausted and the loop breaks).
24
+
- So "first return" is often "first error from one upstream", not "tried everything". Letting that first error cancel the other hedge (e.g. QuickNode) is what causes the trace_filter case: Alchemy errors fast, we cancel QuickNode which would have succeeded in 2–3s.
25
+
26
+
So the loop doesn’t remove the need for "don’t cancel on first error"; it’s what makes the current "cancel on any return" strict — one execution can exit quickly with an error and kill the other.
27
+
28
+
## Side effects of "don’t cancel on first error"
29
+
30
+
| Scenario | Current (cancel on any return) | Don’t cancel on error |
| First returns **success**| Other hedges cancelled ✓ | Same ✓ |
33
+
| First returns **error**, another would succeed | Other hedges cancelled ✗ (e.g. QuickNode discarded) | Other can complete ✓ |
34
+
| All return **errors**| Return after first error (low latency) ✓ | Wait for slowest (higher latency) ✗ |
35
+
| First returns **client/execution** error (same everywhere) | Other hedges cancelled, we return quickly ✓ | We’d still wait for others for no benefit ✗ |
36
+
37
+
So a good refinement is: **cancel only on terminal errors**, not on every error.
38
+
39
+
-**Terminal errors** (cancel others): same on every upstream, no need to wait for more.
40
+
-`IsClientError(err)` (bad request, range exceeded, etc.)
-**Non-terminal errors** (don’t cancel): method not supported, 5xx, timeout, missing data, etc. — another hedge might succeed.
43
+
44
+
## Recommended refinement
45
+
46
+
In `CancelIf`, instead of:
47
+
48
+
```go
49
+
if err != nil {
50
+
returntrue// cancel on any error
51
+
}
52
+
```
53
+
54
+
use:
55
+
56
+
```go
57
+
if err != nil {
58
+
// Cancel only on terminal errors; let other hedges complete on transient/upstream-specific errors.
59
+
if common.IsClientError(err) || common.HasErrorCode(err, common.ErrCodeEndpointExecutionException) {
60
+
returntrue
61
+
}
62
+
returnfalse
63
+
}
64
+
```
65
+
66
+
Effects:
67
+
68
+
-**trace_filter**: Alchemy returns "method not supported" → we don’t cancel → QuickNode can return success (or its own error) → we use the first success or aggregate failures.
69
+
-**All fail**: We wait for all hedges to finish, then failsafe/retry sees the errors. Latency is max of hedge durations instead of min; acceptable if we prefer success when any single hedge can succeed.
70
+
-**Client/revert**: We still cancel on first terminal error and return quickly.
71
+
72
+
## Summary
73
+
74
+
- We didn’t avoid "don’t cancel on first error" because of the loop; the loop is why one execution often returns quickly with one upstream’s error and then we cancel the rest.
75
+
- Full "don’t cancel on error" would fix the trace_filter case but worsen latency when every hedge fails.
76
+
-**Cancel only on terminal error** (client + execution exception) keeps latency good for deterministic failures and lets other hedges complete when the first failure is upstream-specific (e.g. method not supported).
0 commit comments