Skip to content
Draft
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,15 @@ Endpoint. The pool has the following properties:
- **Rate-limited:** A Pool MUST limit the number of [Connections](#connection) being
[established](#establishing-a-connection-internal-implementation) concurrently via the **maxConnecting**
[pool option](#connection-pool-options).
- **Backoff-capable** A pool MUST be able to enter backoff mode. A pool will automatically enter backoff mode when a
connection checkout fails under conditions that indicate server overload. The rules for entering backoff mode are as
follows: - A network error or network timeout during the TCP handshake or the `hello` message for a new connection
MUST trigger the backoff state. - Other pending connections MUST not be canceled. - In the case of multiple pending
connections, the backoff attempt number MUST only be incremented once. This can be done by recording the state prior
Comment on lines +288 to +291
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking with @ShaneHarvey about this (related comments on drivers ticket). Shane's understanding is that we decided to include all timeout errors, regardless of where it originated, during connection establishment. Does that match your understanding, Steve?

And related; the design says:

After a connection establishment failure the pool enters the PoolBackoff state.

We should update the design with whatever the outcome of this thread is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good callout, but it can't be from auth, since the auth spec explicitly calls out the timeout behavior. I'm assuming all drivers can distinguish between hello and auth since the are separate commands. I'll update to say if the driver can distinguish between TCP connect/DNS and the TLS handshake then it MUST do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which timeout behavior in the auth spec are you referring to? I searched for timeout and only saw stuff about what timeout values to use, but not how to handle network timeouts. Maybe I'm looking in the wrong place though.

I'm assuming all drivers can distinguish between hello and auth since the are separate commands.
If we decide to omit network errors during authentication, I think that's a fine assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the specs must have gotten out of sync. I'm referring to:

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.md#why-mark-a-server-unknown-after-an-auth-error

"The Authentication spec requires that when authentication fails on a server, the driver MUST clear the server's connection pool"

to attempting the connection. While the Pool is in backoff, it exhibits the following behaviors: - **maxConnecting**
MUST be set to 1. - The Pool MUST wait for the backoff duration before another connection attempt. - A successful
heartbeat MUST NOT change the state of the pool. - A failed heartbeat MUST clear the pool. - A subsequent failed
connection MUST increase the backoff attempt. - A successful connection MUST return the Pool to ready state.

```typescript
interface ConnectionPool {
Expand Down Expand Up @@ -314,12 +323,17 @@ interface ConnectionPool {
* - "ready": The healthy state of the pool. It can service checkOut requests and create
* connections in the background. The pool can be set to this state via the
* ready() method.
*
* - "backoff": The pool is in backoff state. MaxConnecting is set to 1 and the pool backoff period
* must be observed before attempting another connection. A subsequent failed connection
* attempt increases the backoff duration. The pool can be set to this state via the
* backoff() method.
*
* - "closed": The pool is destroyed. No more Connections may ever be checked out nor any
* created in the background. The pool can be set to this state via the close()
* method. The pool cannot transition to any other state after being closed.
*/
state: "paused" | "ready" | "closed";
state: "paused" | "ready" | "backoff" | "closed";

// Any of the following connection counts may be computed rather than
// actually stored on the pool.
Expand Down Expand Up @@ -360,6 +374,11 @@ interface ConnectionPool {
*/
clear(interruptInUseConnections: Optional<Boolean>): void;

/**
* Enter backoff mode or increase backoff amount if already in backoff mode. Mark the pool as "backoff".
*/
backoff(): void

/**
* Mark the pool as "ready", allowing checkOuts to resume and connections to be created in the background.
* A pool can only transition from "paused" to "ready". A "closed" pool
Expand Down Expand Up @@ -829,6 +848,34 @@ interface PoolClearedEvent {
interruptInUseConnections: Optional<Boolean>;
}

/**
* Emitted when a Connection Pool is in backoff
*/
interface PoolBackoffEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;

/**
* The backoff attempt number.
*
* The incrementing backoff attempt number. This is included because
* the backoff duration is non-deterministic due to jitter.
*/
attempt: int64;

/**
* The duration the pool will not allow new connection establishments.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
}


/**
* Emitted when a Connection Pool is closed
*/
Expand Down Expand Up @@ -1074,6 +1121,21 @@ placeholders as appropriate:

> Connection pool for {{serverHost}}:{{serverPort}} cleared for serviceId {{serviceId}}

#### Pool Backoff Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

| Key | Suggested Type | Value |
| ---------- | -------------- | ---------------------------- |
| message | String | "Connection pool in backoff" |
| attempt | Int | The backoff attempt number. |
| durationMS | Int | Int32/Int64/Double |

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in
placeholders as appropriate:

> Connection pool for {{serverHost}}:{{serverPort}} in backoff. Attempt: {{attempt}}. Duration: {{durationMS}} ms

#### Pool Closed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Expand Down Expand Up @@ -1375,6 +1437,8 @@ to close and remove from its pool a [Connection](#connection) which has unread e

## Changelog

- 2025-XX-YY: Introduce "backoff" state.

- 2025-01-22: Clarify durationMS in logs may be Int32/Int64/Double.

- 2024-11-27: Relaxed the WaitQueue fairness requirement.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Valid Unit Test Operations are the following:
- `interruptInUseConnections`: Determines whether "in use" connections should be also interrupted
- `pool.close()`: call `close` on Pool
- `pool.ready()`: call `ready` on Pool
- `pool.backoff()`: call `backoff` on Pool

## Integration Test Format

Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
version: 1
style: integration
description: pool enters backoff on connection close
runOn:
- minServerVersion: 4.9.0
failPoint:
configureFailPoint: failCommand
mode:
times: 1
data:
failCommands:
- isMaster
- hello
closeConnection: true
poolOptions:
minPoolSize: 0
operations:
- name: ready
- name: start
target: thread1
- name: checkOut
thread: thread1
- name: waitForEvent
event: ConnectionCreated
count: 1
- name: waitForEvent
event: ConnectionCheckOutFailed
count: 1
events:
- type: ConnectionCheckOutStarted
- type: ConnectionCreated
- type: ConnectionClosed
- type: ConnectionPoolBackoff
- type: ConnectionCheckOutFailed
ignore:
- ConnectionCheckedIn
- ConnectionCheckedOut
- ConnectionPoolCreated
- ConnectionPoolReady

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,37 @@ version: 1
style: integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't modify spec tests like this anymore, because that will break for drivers using submodules to track spec tests who haven't implemented backoff yet, and if those drivers then skip these tests they lose coverage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is a tricky one, because we're changing the behavior of the driver. So I guess we need a new runOnRequirement that is something like supportsPoolBackoff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like that would work.

Copy link
Contributor

@baileympearson baileympearson Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. And we should also do the inverse for the tests we're leaving alone: runOnRequirement of !supportsPoolBackoff. (I just spent some time debugging a failing SDAM test on my branch only to realize it was supposed to fail with my changes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added poolBackoff but have not yet updated the existing tests that were changed in this PR.

description: error during minPoolSize population clears pool
runOn:
-
# required for appName in fail point
minServerVersion: "4.9.0"
- minServerVersion: 4.9.0
failPoint:
configureFailPoint: failCommand
# high amount to ensure not interfered with by monitor checks.
mode: { times: 50 }
mode: alwaysOn
data:
failCommands: ["isMaster","hello"]
closeConnection: true
appName: "poolCreateMinSizeErrorTest"
failCommands:
- isMaster
- hello
errorCode: 18
appName: poolCreateMinSizeErrorTest
poolOptions:
minPoolSize: 1
backgroundThreadIntervalMS: 50
appName: "poolCreateMinSizeErrorTest"
appName: poolCreateMinSizeErrorTest
operations:
- name: ready
- name: waitForEvent
event: ConnectionPoolCleared
count: 1
# ensure pool doesn't start making new connections
- name: wait
ms: 200
events:
- type: ConnectionPoolReady
address: 42
- type: ConnectionCreated
address: 42
- type: ConnectionPoolCleared
address: 42
- type: ConnectionClosed
address: 42
connectionId: 42
reason: error
- type: ConnectionPoolCleared
address: 42
ignore:
- ConnectionPoolCreated
Loading
Loading