-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 8 commits
588f1f2
e467f5b
d55fdb9
8e74b41
072b453
52e2a35
391c951
92501c0
0fdef39
82acab8
b3a7b6c
60a87b8
399a56b
0545e15
ff5475a
6211624
c1001bc
def5fbd
08da5c4
034b85e
779e171
e5d4de6
1cd95fc
6912e45
88e6067
2019678
d3ce32b
de7e862
ab85a5b
eb10ddb
2b90697
0abf373
d07d49e
747c18c
be27bc0
92e479a
b674f13
03065ad
1b7f6df
eac90fc
0533acc
629aec9
9e4ad70
d4d0b38
530e727
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Client Backpressure Tests | ||
|
|
||
| ______________________________________________________________________ | ||
|
|
||
| ## Introduction | ||
|
|
||
| The YAML and JSON files in this directory are platform-independent tests meant to exercise a driver's implementation of | ||
| retryable reads. These tests utilize the [Unified Test Format](../../unified-test-format/unified-test-format.md). | ||
|
|
||
| Several prose tests, which are not easily expressed in YAML, are also presented in this file. Those tests will need to | ||
| be manually implemented by each driver. | ||
|
|
||
| ### Prose Tests | ||
|
|
||
| #### Test 1: Operation Retry Uses Exponential Backoff | ||
|
|
||
| Drivers should test that retries do not occur immediately when a SystemOverloadedError is encountered. | ||
|
|
||
| 1. let `client` be a `MongoClient` | ||
| 2. let `collection` be a collection | ||
| 3. Now, run transactions without backoff: | ||
| 1. Configure the random number generator used for jitter to always return `0` -- this effectively disables backoff. | ||
|
|
||
| 2. Configure the following failPoint: | ||
|
|
||
| ```javascript | ||
| { | ||
| configureFailPoint: 'failCommand', | ||
| mode: 'alwaysOn', | ||
| data: { | ||
| failCommands: ['insert'], | ||
jmikola marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| errorCode: 2, | ||
| errorLabels: ['SystemOverloadedError', 'RetryableError'] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 3. Execute the following command. Expect that the command errors. Measure the duration of the command execution. | ||
|
||
|
|
||
| ```javascript | ||
| const start = performance.now(); | ||
| expect( | ||
| await coll.insertOne({ a: 1 }).catch(e => e) | ||
| ).to.be.an.instanceof(MongoServerError); | ||
| const end = performance.now(); | ||
| ``` | ||
|
|
||
| 4. Configure the random number generator used for jitter to always return `1`. | ||
|
|
||
| 5. Execute step 3 again. | ||
|
|
||
| 6. Compare the two time between the two runs. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume the following lines should all appear under element six. If so, indentation is needed (check preview). |
||
| ```python | ||
| assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1) | ||
| ``` | ||
Jibola marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| The sum of 5 backoffs is 3.1 seconds. There is a 1-second window to account for potential variance between the two | ||
| runs. | ||
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. | ||
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specification introduces the overload retry policy, but similarly to the design, omits a very important piece: how should the current retry policy and the overload retry policy coexist? At the very least, the specification should cover the following (generally speaking, it's better if it does that by clearly expressing a principle from which the answers may be easily derived, rather than answering each question explicitly, as there may be more questions that have to be answered that I did not think about at the moment):
1.1. I suspect, it currently is not, because the overload retry policy for now requires both
RetryableErrorandSystemOverloadedErrorto be present. However, he specification should make the answer clear.2.1. The same question is for two attempts
a(n),a(n+1)where the latter immediately1 follows the former, with the former,a(n), not being the first attempt.2.1.1. Note that such a situation may be encountered more than once for a single operation.
3.1. The same question is for two attempts
a(n),a(n+1)where the latter immediately1 follows the former, with the former,a(n), not being the first attempt.3.1.1. Note that such a situation may be encountered more than once for a single operation.
1 In terms of ordering relations, not in the temporal sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated pseudocode should answer these questions - let me know if there's anything else you'd like clarified.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the requirements should be specified in the prose part of the specification. A reader should not have to look through the pseudocode or the tests to find any requirements missing from the propose part.
Once the prose part is complete, it will make sense to review the pseudocode again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all of these questions are now reflected in both the prose and the pseudocode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think 1 and 4 have been answered (whether the answer to question 1 is important depends on what the answer to question 4 is).
I also realized that at least one more thing is missing:
retryable-writes.mdspecification instruct a driver to modify the command when retrying under the write retry policy,transactions.mdinstructs the opposite - not to modifytxnNumberwhen retrying under the write retry policy. I did not find a place specifying that the same command modifications, or the lack of thereof, are to be applied when retrying under the overload retry policy. I believe, the right place to specify this requirement, given the current structure of the specifications, is the "Overload retry policy" section inclient-backpressure.md. When specifying it, we should not duplicate what is said in theretryable-writes.md,transactions.mdspecifications, but instead refer to them, saying that the command has to be modified, or not, in accordance to them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, r.e. your new bullet point and point 4 from above. Both are now addressed in the prose.
R.e. 1: In the prose or pseudocode, I don't think it matters either way: a command would be retried in either scenario, and backoff is only applied if it is an overload error.
I considered adding a note to clarify this but decided against it. We have https://jira.mongodb.org/browse/DRIVERS-3352 planned, which will make
RetryableErrora criteria for retryable reads/writes (and removing it as a requirement from the algorithm here), making this question moot. Steve and I decided against including it in the initial spec because:Let me know if that addresses your concerns.