DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

baileympearson · 2025-12-02T17:15:50Z

Overview

This PR adds support for a new class of errors (SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.

Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).

Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806

Testing

The testing strategy is two-fold:

Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:
- operations are retried using the new SystemOverloadedError label
- operations are retried no more than 5 (current MAX_ATTEMPTS, as defined in the spec) times
Following Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).

- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers

source/logging/logging.md

source/client-backpressure/client-backpressure.md

source/retryable-reads/retryable-reads.md

source/retryable-writes/retryable-writes.md

source/client-backpressure/client-backpressure.md

blink1073 · 2025-12-03T16:31:19Z

It looks like you also need to bump the schema version:

source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3

blink1073 · 2025-12-03T21:05:18Z

WIP Python implementation: mongodb/mongo-python-driver#2635

blink1073 · 2025-12-03T23:11:38Z

All unified and prose tests are passing in the Python implementation.

Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating...

Edit 2: we're all good now

jyemin

I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.

source/client-backpressure/client-backpressure.md

jyemin · 2025-12-10T19:37:07Z

source/client-backpressure/client-backpressure.md

+
+## Changelog
+
+- 2025-XX-XX: Initial version.


Not sure how we handle the date... Is there an automation for this?

Not that I know of. Usually the spec author fills it out before merging

I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.

source/client-backpressure/tests/README.md

source/client-backpressure/client-backpressure.md

source/retryable-writes/retryable-writes.md

source/client-backpressure/client-backpressure.md

stIncMale · 2025-12-26T19:34:14Z

source/client-backpressure/client-backpressure.md

+        retry budget tracking, this counts as a success.
+4. A retry attempt will only be permitted if:
+    1. The error has both the `SystemOverloadedError` and the `RetryableError` label.
+    2. We have not reached `MAX_ATTEMPTS`.


Judging both from the name of MAX_ATTEMPTS, and this sentence, a reader develops understanding that MAX_ATTEMPTS is the maximum number of attempts, which, includes the first attempt that is not a retry attempt. However, below, the specification says "Any retryable error is retried at most MAX_ATTEMPTS (default=5) times", which means that MAX_ATTEMPTS is the maximum number of retry attempts, implying that the maximum number of attempts is actually 6.

If MAX_ATTEMPTS is meant to specify the maximum number of retry attempts, let's name it accordingly: MAX_RETRY_ATTEMPTS.

There is still a place left where client-backpressure.md uses MAX_ATTEMPTS.

source/client-backpressure/client-backpressure.md

stIncMale · 2025-12-27T19:48:15Z

source/client-backpressure/client-backpressure.md

+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:
+
+1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.


The specification currently uses various terms to refer to the same thing¹ when it talks about retry tokens (hopefully, I identified them all, but the author should do the exercise on his own to verify):

return, deposit, consume

deposit, return

We should pick a single term for each meaning and use it consistently. Having clear and consistent terminology is a necessity for clear communications, it also allows makes the specification searchable: if one wants to find all the places that talk about consuming retry tokens, one can search the page for a specific term. Currently, that is impossible.

¹ Our specifications in general, as well as the official documentation the company produces, suffer from this significant issue.

Good point - I've aligned the language in the spec on deposit and consume.

client-backpressure.md still uses "acquire" instead of "consume" in "A token can be acquired from the token bucket".

source/client-backpressure/client-backpressure.md

stIncMale · 2025-12-27T20:20:56Z

source/client-backpressure/client-backpressure.md

+- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.
+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:


Don't these two sentences say the same? If yes, let's leave only one of them.

No, I don't think so. The first says that only errors with that label backoff (no other errors backoff); the second mandates that all errors with apply the backoff.

Sorry, the way I formulated the question was wrong. I am reformulating my suggestion below.

The current text is

- Only errors with the `SystemOverloadedError` label apply backoff. - All retryable errors apply backoff if they also contain a `SystemOverloadedError` label.

If I understand the meaning of the two items above correctly, the following single item has the same meaning

- When a failed attempt is retried, backoff must be applied if and only if the attempt error has the `SystemOverloadedError` label.

Let's simplify and use one item instead of two. Note also that in the proposed text I use "has" instead of "contains" when talking about a label, because the rest of client-backpressure.md uses "has" for this purpose.

Sure - done.

stIncMale · 2025-12-27T21:35:59Z

source/client-backpressure/client-backpressure.md

+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:
+    - Errors defined as retryable in the retryable reads specification.
+    - Errors defined as retryable in the retryable writes specification.
+    - Errors with the `RetryableError` label.


Given that a command may be retried either because of the read/write retry policy, or because of the overload retry policy, what does the i mean in the formula for delayMS? Is it the 1-based retry attempt number among all the retry attempts of a command, or only among the subset caused by the overload retry policy?

The answer to this question should be in the "Interaction with Existing Retry Behavior" section, not in the "Overload retry policy" section.

Okay, I've clarified this. But I chose to put it in the "overload retry policy" section because it makes sense to me to put this information with the definition of i and the exponential backoff formula.

it makes sense to me to put this information with the definition of i and the exponential backoff formula

If this approach were applied consistently to client-backpressure.md, all items from the "Interaction with Existing Retry Behavior" section would have been put in the "Overload retry policy" section, because all those items necessarily talk about something from the "Overload retry policy". However, that's not the case, and putting "Note that i includes retries for non-overloaded errors." in the "Overload retry policy" section not only goes against the current structure of the specification, but also confuses the reader, because the "Overload retry policy" section allows retries only for "overloaded" errors.

I appreciate this input, but I disagree. The note here is relevant to the definition of a variable defined here, the other behaviors mentioned in the "interaction with existing retry policies" section are not definitional.

I am not going to make this suggestion unless you consider this comment to be blocking for this PR to merge, in which case we can discuss. I consider this thread resolved.

stIncMale · 2025-12-27T21:50:57Z

source/client-backpressure/client-backpressure.md

+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:
+    - Errors defined as retryable in the retryable reads specification.
+    - Errors defined as retryable in the retryable writes specification.
+    - Errors with the `RetryableError` label.
+- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
+    `SystemOverloadedError`.


The "RetryableError label" section above says "An error is considered retryable if it includes the "RetryableError" label". After reading that, it is only reasonable to understand that a "retryable errror" is an error with the RetryableError label. However, in the selected items the specification seemingly uses the same term "retryable error" with a different meaning. This needs to be fixed.

I don't think that sections "RetryableError label", "SystemOverloadedError label" defining the terms "retryable error", "overloaded error" help a reader, neither do the the terms. Instead I propose the specification to always talk about an error having the RetryableError, SystemOverloadedError labels, which it already does in many places.

#1862 (comment) is related, but different.

I don't fully agree about the definitions not having value. I'll keep them, unless you feel strongly, because it familiarizes readers with new terminology introduced by this specification.

I have clarified the definition of RetryableError label and the phrasing here to avoid ambiguity though. I believe this addresses the concern.

I don't fully agree about the definitions not having value. I'll keep them, unless you feel strongly, because it familiarizes readers with new terminology introduced by this specification.

The only way I see to interpret this argument is that definitions familiarize readers with the new terms they define. That is trivially true, but has nothing to do with those terms having value.

If it's the descriptions of the RetryableError, SystemOverloadedError error labels that you care about, then I agree, we need them. But those descriptions introducing the new terms "retryable error", "overloaded error" is a separate concern, and I am arguing that these new terms mostly introduce confusion, and reduce searchability of the specification. To search for the same thing, one must search for the RetryableError and also for the "retryable error", the same is true for SystemOverloadedError/"overloaded error", because client-backpressure.md uses both ways instead of either always referring only to error labels, or always using the "retryable error", "overloaded error" terms: client-backpressure.md uses "overloaded error" 3 times, and refers to the SystemOverloadedError label more than 20 times (not always using the same style, which I commented about in #1862 (comment)).

I have clarified the definition of RetryableError label and the phrasing here to avoid ambiguity though. I believe this addresses the concern.

Those changes (lines 38-41, and lines 123-128) do not address my concerns expressed in this thread.

I appreciate the input here but I disagree.

The term "retryable error" is no longer defined or used in the terms section. The "overload error" section provides information to driver engineers about scenarios under which the general class of errors that drivers encounter when the server is overloaded.

Regarding the usage of "overloaded error" versus using precise label terminology: I prefer having both because the precise definition (using the label) is not always required. For example, in the summary:

This specification adds the ability for drivers to automatically retry requests that fail due to server overload errors
while applying backpressure to avoid further overloading the server.

This reads more naturally and better imo. I believe I've ensured that everywhere where precision is required (i.e., in algorithmic prose) we use SystemOverloadedError specifically.

I do not intend to remove this section further. If there are outstanding concerns here that block the PR from merging, please let me know and we can discuss further.

stIncMale · 2025-12-27T22:11:35Z

source/client-backpressure/client-backpressure.md

+- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
+    `SystemOverloadedError`.


I fail to understand what this requirement means. For example, "Any retryable error is retried" is very confusing, since we are trying to execute a command, not trying to execute/produce an error. The current wording also implies that the retry attempts are counted per each specific error, which can't be true, as they are counter per, for example, an execution of a command.

Does the following express the intended meaning?

Once we encounter such a command execution error with the SystemOverloadedError label that it permits a retry attempt according to the overload retry policy, the number of retry attempts for the particular command execution becomes limited by MAX_ATTEMPTS regardless of which retry policy the previous or the future retry attempts are caused by.

If "yes", then let's change the wording.

Note that the above means a command may still be retried more than MAX_ATTEMPTS times in the presence of errors having the SystemOverloadedError label. For example, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label.

That's correct - done.

Note that the above means a command may still be retried more than MAX_ATTEMPTS times in the presence of errors having the SystemOverloadedError label. For example, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label.

Yes, that's correct.

Thank you, the new wording is much clearer, but still suffers from one of the previous issues. The new wording is

Any command is retried at most MAX_ATTEMPTS (default=5) times, if any attempt has failed with a SystemOverloadedError, regardless of which retry policy the current or future retry attempts are caused by.

As before, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label, then the requirement is violated, and there is nothing an implementation could do to strictly adhere to the requirement.

Sorry; I don't follow. If a command is retried 6 times for CSOT and the 6th retry fails with a retryable overload error, we don't retry because we have reached MAX_RETRIES.

Can you clarify the concern?

stIncMale · 2025-12-27T22:40:37Z

source/client-backpressure/client-backpressure.md

+#### RetryableError label
+
+An error is considered retryable if it includes the "RetryableError" label. This error label indicates that an operation
+is safely retryable regardless of the type of operation, its metadata, or any of its arguments.


crud/bulk-write.md also talks about a "retryable error", but means some thing very different from what "retryable error" means here.

Let's not overload this term by re-defining it with a different meaning here, as overloaded terms introduce confusion and have no benefits.
1.1. If a term has to be overloaded nonetheless, which I doubt, let's at the very least make the changes to crud/bulk-write.md necessary to reduce confusion.

The term "retryable error" is also defined in transactions/transactions.md. Fortunately, the meaning there does not need to be clarified.

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862 (comment) is related, but different.

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862 (comment) is related, but different.

Overloading this term makes sense imo because we now have a very concrete definition of the phrase "retryable error" (i.e., the error has the RetryableError label attached). I was not aware of the phrasing in the bulk write spec - I will update this spec as well.

1

Overloading this term makes sense imo because we now have a very concrete definition of the phrase "retryable error" (i.e., the error has the RetryableError label attached).

This "very concrete definition" is one more meaning (different from the previous) for the existing term; quite a confusing one, given that the mere fact of an error having the RetryableError label does not, at least for now, make it eligible for retry under any retry policy. Overloaded terms, especially when used within the same context with different meanings, can only cause confusion. In the current OR, only within client-backpressure.md, the "retryable error" is used with at least four different meanings:

An error with the RetryableError label

https://github.com/baileympearson/specifications/blob/9e4ad7093283a6e8b3978955b72ade7eb205229e/source/client-backpressure/client-backpressure.md?plain=1#L38-L41

https://github.com/baileympearson/specifications/blob/9e4ad7093283a6e8b3978955b72ade7eb205229e/source/client-backpressure/client-backpressure.md?plain=1#L84-L85

A union of the following meanings: an error eligible for retry under the read retry policy, an error eligible for retry under the write retry policy, an error with the RetryableError label.

https://github.com/baileympearson/specifications/blob/9e4ad7093283a6e8b3978955b72ade7eb205229e/source/client-backpressure/client-backpressure.md?plain=1#L123-L126

A union of the following meanings: an error eligible for retry under the read retry policy, an error eligible for retry under the write retry policy, an error eligible for retry under the overload retry policy.

https://github.com/baileympearson/specifications/blob/9e4ad7093283a6e8b3978955b72ade7eb205229e/source/client-backpressure/client-backpressure.md?plain=1#L163

https://github.com/baileympearson/specifications/blob/9e4ad7093283a6e8b3978955b72ade7eb205229e/source/client-backpressure/client-backpressure.md?plain=1#L170-L171

An error eligible for retry under the overload retry policy.

https://github.com/baileympearson/specifications/blob/9e4ad7093283a6e8b3978955b72ade7eb205229e/source/client-backpressure/client-backpressure.md?plain=1#L260-L261

Someone might say the new "retryable error" meaning, harmful as it is, may also be useful because it is shorter to write than "an error with the RetryableError label". I would not agree that such shortening is worth the harm introduced, and even the specification itself (client-backpressure.md) refers to the RetryableError label more often than it uses the "retryable error" term in its "an error with the RetryableError label" meaning (8 times vs 5 times), which demonstrates that the new meaning of "retryable error" is not particularly useful.

The mix of meanings within a single specification described above is confusing, and we should eliminate it. Given that we now have multiple retry policies, the term "retryable error" should probably not be used at all, and be replaced with "an error eligible for retry under..." (with "..." specifying one or multiple retry policies).

2

The term "retryable error" is still used in crud/bulk-write.md without a clarification about its meaning. The future reader, who will try to understand what the meaning is, will have to choose from all the meanings mentioned above. I am asking again to make the changes to crud/bulk-write.md necessary to reduce this confusion.

I do agree that defining a new "retryable error" term with a specific meaning is confusing. I have reverted the definition of "retryable error" in this specification.

All places that refer specifically to errors with the RetryableError label now do so explicitly, and the general "retryable error" term is used when discussing retryability generally. For that reason, I also have opted not to update the phrasing in the bulk write specification, because the way it is used aligns with the usages of "retryable error" in this specification.

Can we consider this thread resolved?

source/client-backpressure/client-backpressure.md

stIncMale · 2025-12-28T16:47:17Z

source/transactions/transactions.md


+#### Backpressure Error
+
+An error considered retryable by the [Client Backpressure Specification](../client-backpressure/client-backpressure.md).


The client backpressure specification says that "an error is considered retryable if it includes the "RetryableError" label". So there, a command with the "RetryableError" label is called retryable error, but here, the same error is called "backpressure error". Not only we are proposed to have the "retryable error" term overloaded (see #1862 (comment)), but we are also proposed to have multiple terms ("retryable error", "backpressure error") referring to the same exact thing. We should fix this:

we should use a single term to refer to the same thing;

we should avoid overloading terms.

P.S. I won't be surprised if the intent here was to express that backpressure error is an error that has both the RetryableError and the SystemOverloadError labels. But that is definitely not what the specification currently expresses.

No; backpressure errors are distinct from retryable backpressure errors. The error labels might always exist together now but the server has intentionally chosen to keep these error labels separate and might choose to only apply one label or the other in the future.

I've clarified this spec to use the phrase "retryable backpressure error".

Changing the term from "backpressure error" to "retryable backpressure error" does not address the concern. Moreover, given that now the words "error considered retryable" are no longer present in client-backpressure.md, there are at least four possible answers to which errors are "considered retryable by the Client Backpressure Specification"¹.

This ambiguity could be fixed by defining "retryable backpressure error" as an error eligible for a retry under the overload retry policy (with "overload retry policy" being a hyperlink to the "Overload retry policy" section in client-backpressure.md), assuming that it is indeed the intended meaning of the term.

¹ See #1862 (comment) for the details on those four possible meanings and links to concrete places in client-backpressure.md where those meanings are used when mentioning a retryable error.

That is already the definition of retryable backpressure error:

Retryable Backpressure Error

An error considered retryable by the Client Backpressure Specification.

Does that resolve this concern?

source/transactions/transactions.md

stIncMale · 2025-12-28T18:11:45Z

source/transactions/transactions.md

+Drivers SHOULD apply a majority write concern when retrying commitTransaction to guard against a transaction being
 applied twice.

+Drivers SHOULD NOT modify the write concern on commit transaction commands when retrying a backpressure error, since we
+are sure that the transaction has not been applied.


I don't think we know whether the transaction has or has not been applied. If, for example, in the scenario below step 4 fails with a backpressure error, it is obviously false that the transaction has not been applied.

I think, this change should be reverted.

I believe this change makes sense but I've adjusted the phrasing. I'm not sure it makes sense as it was written. Hopefully it's clearer now.

The new wording is (see lines 1066-1068)

Drivers SHOULD NOT modify the write concern on commit transaction commands when retrying a retryable backpressure error. A retryable backpressure error indicates no work was performed by the server, and the rationale outlined in this section for using majority write concern on retries is therefore irrelevant.

The same problem I pointed out originally still applies. While it is true that retryable backpressure error indicates no work was performed by the server¹ as a result of the attempt to execute a command that failed with this error, it does not mean that no work was performed by the server as a result of all completed attempts to execute the command. The latter is required for the claim "the rationale outlined in this section for using majority write concern on retries is therefore irrelevant" to be correct. The example I originally briefly described also demonstrates that the claim is incorrect, as far as my understanding goes.

¹ This is assuming the meaning of "retryable backpressure error" is as follows: an error eligible for a retry under the overload retry policy. The meaning is currently unclear, as I explained in #1862 (comment).

There is a misunderstanding here - this change does not change the behavior when a driver encounters a retryable that is not an overload error. Drivers still use w: majority for retries if a non-backpressure retryable error is encountered. Does that make sense?

stIncMale

Notes for myself:

The last reviewed commit is 1b7f6df.
I reviewed only the changes in .md files, and do not plan to review the tests.
I have not re-reviewed the pseudocode illustrating the overload retry policy. I will do that after us settling on the requirements expressed in the prose part of the specification.

baileympearson

First round of comments addressed.

baileympearson · 2026-01-05T20:24:15Z

source/client-backpressure/client-backpressure.md

+        retry budget tracking, this counts as a success.
+4. A retry attempt will only be permitted if:
+    1. The error has both the `SystemOverloadedError` and the `RetryableError` label.
+    2. We have not reached `MAX_ATTEMPTS`.


source/client-backpressure/client-backpressure.md

baileympearson · 2026-01-05T20:40:05Z

source/client-backpressure/client-backpressure.md

+- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
+    `SystemOverloadedError`.


That's correct - done.

Note that the above means a command may still be retried more than MAX_ATTEMPTS times in the presence of errors having the SystemOverloadedError label. For example, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label.

Yes, that's correct.

baileympearson · 2026-01-05T20:42:54Z

source/client-backpressure/client-backpressure.md

+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:
+
+1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.


Good point - I've aligned the language in the spec on deposit and consume.

source/client-backpressure/client-backpressure.md

baileympearson · 2026-01-05T20:47:17Z

source/client-backpressure/client-backpressure.md

+#### RetryableError label
+
+An error is considered retryable if it includes the "RetryableError" label. This error label indicates that an operation
+is safely retryable regardless of the type of operation, its metadata, or any of its arguments.


Overloading this term makes sense imo because we now have a very concrete definition of the phrase "retryable error" (i.e., the error has the RetryableError label attached). I was not aware of the phrasing in the bulk write spec - I will update this spec as well.

source/client-backpressure/client-backpressure.md

jmikola

Leaving some more notes about formatting, which you can take or leave as you wish.

The hand-written client-backpressure tests also appear to use inconsistent formatting for whitespace, but it's all visual and I doubt anyone else cares about that so I'll spare you diff comments.

jmikola · 2026-01-06T15:39:06Z

source/client-backpressure/tests/backpressure-retry-loop.yml

+    description: 'client.listDatabases retries using operation loop'
+    operations:      
+
+


I realize this file is generated from backpressure-retry-loop.yml.template, but the whitespace here is a bit odd.

jmikola · 2026-01-06T15:40:35Z

source/client-backpressure/tests/backpressure-retry-loop.yml.template

+          name: "x_11"
+        {%- endif %}      
+
+


The whitespace here is probably what's introducing extra newlines into generated files.

jmikola · 2026-01-06T15:41:16Z

source/client-backpressure/tests/backpressure-retry-loop.yml.template

+
+      -
+        object: *{{operation.object}}
+        name: {{operation.operation_name}}


Consider swapping name and object lines here and putting name on the same line as the array element dash character for consistency.

jmikola · 2026-01-06T15:43:13Z

source/client-backpressure/tests/README.md

+
+    5. Execute step 3 again.
+
+    6. Compare the two time between the two runs.


I assume the following lines should all appear under element six. If so, indentation is needed (check preview).

jmikola · 2026-01-06T15:44:15Z

source/client-backpressure/tests/getMore-retried.yml

+          filter: {}
+          # ensure stable ordering of result documents
+          sort: { a: 1 }
+        object: *collection


name and object are typically included side by side for readability.

Co-authored-by: Ferdinando Papale <[email protected]>

stIncMale

Notes for myself:

The last reviewed commit is d4d0b38.
- Since the last review, the changes constitute of this diff as well as d4d0b38, with the latter not changing anything I am reviewing.
I reviewed only the changes in .md files, and do not plan to review the tests.
I have not re-reviewed the pseudocode illustrating the overload retry policy. I will do that after us settling on the requirements expressed in the prose part of the specification.

stIncMale · 2026-01-07T22:45:41Z

source/client-backpressure/client-backpressure.md

+
+#### RetryableError label
+
+This error label indicates that an command is safely retryable regardless of the command type (read or write), its


"an command" -> "a command"

Suggested change

This error label indicates that an command is safely retryable regardless of the command type (read or write), its

This error label indicates that a command is safely retryable regardless of the command type (read or write), its

stIncMale · 2026-01-07T23:20:08Z

source/client-backpressure/client-backpressure.md

+#### Interaction with Existing Retry Behavior
+
+The retry policy in this specification is separate from the existing retryability policies defined in the


This is a follow-up to #1862 (comment), where we agreed to use "policy" instead of "API"/"behavior" when referring to a logically isolated set of retry rules, because that is the term this specification uses in the "Overload retry policy" section.

"Interaction with Existing Retry Behavior"
1.1. We should say "policy" instead of "behavior".
1.2. "Existing" is out of place here. For someone who reads this specification, all the retry policies (read, write, overload) will exist. Instead, we should say something like "Interaction with Read and Write Retry Policies", or "Interaction with Other Retry Policies".

[optional] Let's simplify "The retry policy in this specification" to "The overload retry policy".

The above consideration about "existing" also applies to "existing retryability policies".

The wordings "retry policy" and "retryability policy" coexist here, but mean the same thing. Let's say "retry policy" instead of "retryability policy", given that above the specification says "Overload retry policy".

P.S. Suggestions marked with [optional] mean that I will accept the refusal regardless of the reasons if they are considered and refused.

Sure; done.

stIncMale · 2026-01-07T23:28:55Z

source/client-backpressure/client-backpressure.md

+The following pseudocode demonstrates the unified retry behavior, combining the overload retry policy defined in this
+specification with the existing retry behaviors from [Retryable Reads](../retryable-reads/retryable-reads.md) and


This is a follow-up to #1862 (comment), where we agreed to use "policy" instead of "API"/"behavior" when referring to a logically isolated set of retry rules, because that is the term this specification uses in the "Overload retry policy" section.

Let's replace "retry behaviors" with "retry policies".

The consideration about "existing" from DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862 (comment) applies here. Let's replace "existing retry behaviors from..." with "retry policies from...".

Sure; done.

stIncMale · 2026-01-07T23:44:31Z

source/client-backpressure/client-backpressure.md

+The maximum retry attempt logic in this specification balances legacy retryability behavior with load-shedding behavior:
+
+- Relying on either 1 or infinite timeouts (depending on CSOT) preserves existing retry behavior.


This is a follow-up to #1862 (comment), where we agreed to use "policy" instead of "API"/"behavior" when referring to a logically isolated set of retry rules, because that is the term this specification uses in the "Overload retry policy" section.

Let's use "policy" here instead of "behavior" here.

I assume, "legacy retryability behavior" here has the same meaning as "existing retryability policies" above. If this is correct, the consideration about "existing" from DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862 (comment) applies here. Let's replace "existing" with "read and write".

Both wordings "retryability behavior" and "retry behavior" are used here. Let's replace "retryability behavior" with "retry policy".

This usage of "behavior" is not the same as the other usages. This text does not refer to a specific policy, but rather a general "behavior". There are a number of usages of "behavior" in this manner in this specification. "preserves existing retry policy" does not make sense.

I generally agree with your comments about the usage of "existing". However, in the Q&A section I think the current phrasing makes sense because it draws comparison between and old and new state and serves as a design rationale. There are other places we use this languages in Q&A in the specifications repo.

I have rephrased this sentence slightly - I do not plan to remove existing or behavior here and I consider this thread resolved. If this comment blocks the PR from merging, let me know and we can discuss.

stIncMale · 2026-01-07T23:48:55Z

source/client-backpressure/client-backpressure.md

+
+The maximum retry attempt logic in this specification balances legacy retryability behavior with load-shedding behavior:
+
+- Relying on either 1 or infinite timeouts (depending on CSOT) preserves existing retry behavior.


I fail to understand what "either 1 or infinite timeouts (depending on CSOT)" mean here. This may be a typo or an ill-formed text.

Most likely because of the above, I fail to understand the whole sentence "Relying on either 1 or infinite timeouts (depending on CSOT) preserves existing retry behavior." Let's express the intended thought clearly.

Fixed. This is supposed to say "infinite retries".

stIncMale · 2026-01-09T11:52:01Z

source/client-backpressure/client-backpressure.md

I believe all of these questions are now reflected in...the prose...

I don't think 1 and 4 have been answered (whether the answer to question 1 is important depends on what the answer to question 4 is).

I also realized that at least one more thing is missing:

The retryable-writes.md specification instruct a driver to modify the command when retrying under the write retry policy, transactions.md instructs the opposite - not to modify txnNumber when retrying under the write retry policy. I did not find a place specifying that the same command modifications, or the lack of thereof, are to be applied when retrying under the overload retry policy. I believe, the right place to specify this requirement, given the current structure of the specifications, is the "Overload retry policy" section in client-backpressure.md. When specifying it, we should not duplicate what is said in the retryable-writes.md, transactions.md specifications, but instead refer to them, saying that the command has to be modified, or not, in accordance to them.

stIncMale · 2026-01-09T12:10:17Z

source/client-backpressure/client-backpressure.md

+- Any command is retried at most MAX_ATTEMPTS (default=5) times, if any attempt has failed with a
+    `SystemOverloadedError`, regardless of which retry policy the current or future retry attempts are caused by.


Could we please add a note here, clarifying/emphasizing that this requirement means a command may be retried more than once under the read or write retry policy even when timeoutMS is not specified? It took a lot of time for this realization of the intent to pop in my mind.

Instead of a note, I've adjusted the phrasing.

stIncMale · 2026-01-09T12:16:01Z

source/client-backpressure/client-backpressure.md

+        - The value of `MAX_RETRIES` is 5 and non-configurable.
+        - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within
+            the timeout to avoid retry storms.
+    3. (CSOT-only): `timeoutMS` has not expired.


CSOT rules are much more complex than simple "timeoutMS has not expired". They involve taking into account some other timeouts configurable by an application, refreshing timeoutMS, etc. Therefore, let's say here something along the lines of "(Only if timeoutMS is configured): There is still time for a retry attempt according to the "Client Side Operations Timeout" specification", where "Client Side Operations Timeout" specification is a hyperlink to the spec.

Sure, done.

stIncMale · 2026-01-09T12:28:10Z

source/transactions/transactions.md

This specification has the below requirements that currently are in conflict with client-backpressure.md.

commitTransaction is a retryable write command. Drivers MUST retry once after commitTransaction fails with a retryable error, including a handshake network error, according to the Retryable Writes Specification, regardless of whether retryWrites is set on the MongoClient or not.

when talking about abortTransaction

If the operation times out or fails with a non-retryable error, drivers MUST ignore all errors from the abortTransaction command. Errors from abortTransaction are meaningless to the application because they cannot do anything to recover from the error.

retryable-writes.md and retryable-reads.md work around such conflicts by saying "Unless otherwise noted, the changes in this specification refer only to the retryability behaviors summarized above.", and letting client-backpressure.md specify the interactions. transactions.md does not use the same approach (I don't know if it should), and instead has its own "Interaction with Client Backpressure" section.

We should resolve these conflicts.

stIncMale · 2026-01-09T12:47:41Z

source/transactions/transactions.md

+In addition, drivers MUST NOT add the RetryableWriteError label to any error that occurs during a write command within a
+transaction (excepting commitTransation and abortTransaction), even when retryWrites has been enabled on the
+MongoClient, unless the server response is a retryable backpressure error.


Is it actually true that when retrying a command that failed with a retryable backpressure error within a transaction, the driver is to add the RetryableWriteError label to that error? Note that

retryable-writes.md does not have a similar update to its requirements about adding the RetryableWriteError label, and currently says "drivers MUST NOT add the RetryableWriteError label to any error that occurs during a write command within a transaction (excepting commitTransation and abortTransaction), even when retryWrites has been set to true on the MongoClient." (given the current structure and wording of the specifications, we should not change this requirement in retryable-writes.md; if the change is needed, it should be in client-backpressure.md)

nor does the client-backpressure.md say anything about RetryableWriteError.

P.S. I fail to understand the value the driver adding the undocumented RetryableWriteError label to errors brings to applications. retryable-writes.md, including its section "Why does the driver only add the RetryableWriteError label to errors that occur on a MongoClient with retryWrites set to true?" does not shed any light on this.

I don't follow - the spec says "drivers MUST NOT" add the label. Can you clarify?

baileympearson added 8 commits December 1, 2025 13:44

initial commit

588f1f2

new files

e467f5b

add tests for handshake changes

d55fdb9

add generated tests

8e74b41

test fixes and add prose test

072b453

- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers

fix run on requirements

52e2a35

fix run on requirements?

391c951

fix CI

92501c0

baileympearson commented Dec 2, 2025

View reviewed changes

source/logging/logging.md Outdated Show resolved Hide resolved

baileympearson commented Dec 2, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

baileympearson marked this pull request as ready for review December 2, 2025 18:59

baileympearson requested review from a team as code owners December 2, 2025 18:59

baileympearson requested review from jmikola and jyemin and removed request for a team December 2, 2025 18:59

blink1073 reviewed Dec 3, 2025

View reviewed changes

source/retryable-reads/retryable-reads.md Outdated Show resolved Hide resolved

source/retryable-writes/retryable-writes.md Outdated Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

baileympearson added 3 commits December 3, 2025 11:36

comments

0fdef39

Fix broken unified tests

82acab8

fix UTR linting failures

b3a7b6c

blink1073 mentioned this pull request Dec 3, 2025

PYTHON-5528 & PYTHON-5651 Add exponential backoff to operation retry loop for server overloaded errors mongodb/mongo-python-driver#2635

Draft

11 tasks

remove broken deleteMany() from unified tests

60a87b8

add backwards compat section

399a56b

jyemin requested changes Dec 10, 2025

View reviewed changes

Jibola requested changes Dec 10, 2025

View reviewed changes

source/client-backpressure/tests/README.md Outdated Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 11, 2025

View reviewed changes

source/retryable-writes/retryable-writes.md Show resolved Hide resolved

stIncMale reviewed Dec 25, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

stIncMale requested changes Dec 27, 2025

View reviewed changes

stIncMale reviewed Dec 27, 2025

View reviewed changes

stIncMale reviewed Dec 28, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 28, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 28, 2025

View reviewed changes

source/transactions/transactions.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 28, 2025

View reviewed changes

source/transactions/transactions.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 28, 2025

View reviewed changes

source/transactions/transactions.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 28, 2025

View reviewed changes

baileympearson commented Jan 5, 2026

View reviewed changes

first round of comments addressed

eac90fc

baileympearson requested a review from sanych-sun January 6, 2026 00:36

second round of comments addressed

0533acc

baileympearson requested a review from a team as a code owner January 6, 2026 00:59

baileympearson requested review from NoahStapp and stIncMale January 6, 2026 01:00

jyemin removed their request for review January 6, 2026 14:57

NoahStapp approved these changes Jan 6, 2026

View reviewed changes

papafe reviewed Jan 6, 2026

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

papafe reviewed Jan 6, 2026

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

jmikola approved these changes Jan 6, 2026

View reviewed changes

baileympearson and others added 3 commits January 7, 2026 13:54

Update source/client-backpressure/client-backpressure.md

629aec9

Co-authored-by: Ferdinando Papale <[email protected]>

Update source/client-backpressure/client-backpressure.md

9e4ad70

Co-authored-by: Ferdinando Papale <[email protected]>

jermery's comments - formatting of yml files and prose test description

d4d0b38

stIncMale requested changes Jan 9, 2026

View reviewed changes

Valentin's comments

530e727

baileympearson requested review from papafe and stIncMale January 9, 2026 22:05

		- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.
		- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:

		- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
		`SystemOverloadedError`.


		#### Backpressure Error

		An error considered retryable by the [Client Backpressure Specification](../client-backpressure/client-backpressure.md).

		description: 'client.listDatabases retries using operation loop'
		operations:


		5. Execute step 3 again.

		6. Compare the two time between the two runs.


		#### RetryableError label

		This error label indicates that an command is safely retryable regardless of the command type (read or write), its

	This error label indicates that an command is safely retryable regardless of the command type (read or write), its
	This error label indicates that a command is safely retryable regardless of the command type (read or write), its

		#### Interaction with Existing Retry Behavior

		The retry policy in this specification is separate from the existing retryability policies defined in the

		The following pseudocode demonstrates the unified retry behavior, combining the overload retry policy defined in this
		specification with the existing retry behaviors from [Retryable Reads](../retryable-reads/retryable-reads.md) and

		The maximum retry attempt logic in this specification balances legacy retryability behavior with load-shedding behavior:

		- Relying on either 1 or infinite timeouts (depending on CSOT) preserves existing retry behavior.

		- Any command is retried at most MAX_ATTEMPTS (default=5) times, if any attempt has failed with a
		`SystemOverloadedError`, regardless of which retry policy the current or future retry attempts are caused by.

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Are you sure you want to change the base?

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Uh oh!

Conversation

baileympearson commented Dec 2, 2025 • edited by blink1073 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blink1073 commented Dec 3, 2025

Uh oh!

blink1073 commented Dec 3, 2025

Uh oh!

blink1073 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyemin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

baileympearson commented Dec 2, 2025 •

edited by blink1073

Loading

blink1073 commented Dec 3, 2025 •

edited

Loading

stIncMale Dec 27, 2025 •

edited

Loading

stIncMale Jan 8, 2026 •

edited

Loading

stIncMale Dec 27, 2025 •

edited

Loading

stIncMale Dec 28, 2025 •

edited

Loading

stIncMale Jan 8, 2026 •

edited

Loading