Skip to content

lease: fix race condition between KeepAlive and lease expiry#21389

Open
liyishuai wants to merge 2 commits intoetcd-io:mainfrom
liyishuai:fix-issue-14758-lease-race
Open

lease: fix race condition between KeepAlive and lease expiry#21389
liyishuai wants to merge 2 commits intoetcd-io:mainfrom
liyishuai:fix-issue-14758-lease-race

Conversation

@liyishuai
Copy link
Copy Markdown

@liyishuai liyishuai commented Feb 27, 2026

Problem

A race condition exists between KeepAlive (Renew) and lease expiry revocation.

When a lease expires, the server's revokeExpiredLeases goroutine proposes a LeaseRevoke through Raft. The apply goroutine then executes Revoke(), which proceeds in two steps:

  1. Delete all keys attached to the lease
  2. Remove the lease from leaseMap

A concurrent KeepAlive request handled by a separate gRPC goroutine can slip in between these two steps. It finds the lease still in leaseMap and successfully renews it, returning a positive TTL to the client — even though the attached keys are already deleted. The client is misled into believing the lease and its keys are still alive.

The race window is small in practice, but real. The integration test reproduces it deterministically using two gofail failpoints that widen the window.

Fixes #14758

Solution

Close lease.revokec at the start of Revoke(), before releasing the lock to delete keys. In Renew(), after acquiring the lock, check whether revokec is closed. If it is, return ErrLeaseNotFound immediately instead of refreshing the lease.

This ensures any Renew() that races with Revoke() observes the revocation signal and reports the lease as gone, rather than returning a positive TTL with the keys already deleted.

AI Disclosure

AI tools (Claude Opus 4.6 via Claude Code) were used in preparing this PR. This is disclosed per the AI guidance. All changes have been reviewed and verified by the human author.

The prompt used: "Fix etcd issue #14758: a race condition where KeepAlive returns a positive TTL after lease keys have been deleted during revocation. Add an integration test reproducing the race, then implement a fix using the revokec channel to signal revocation early in Revoke() and check it in Renew()."

@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liyishuai
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown

Hi @liyishuai. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@liyishuai liyishuai force-pushed the fix-issue-14758-lease-race branch 3 times, most recently from 45ecb2c to 7fac93b Compare February 27, 2026 03:59
@liyishuai liyishuai changed the title fix(lease): reject KeepAlive during lease revocation lease: fix KeepAlive returning positive TTL during concurrent revocation Feb 27, 2026
@liyishuai liyishuai changed the title lease: fix KeepAlive returning positive TTL during concurrent revocation lease: fix race condition between KeepAlive and lease expiry Feb 27, 2026
@liyishuai liyishuai force-pushed the fix-issue-14758-lease-race branch from 7fac93b to 27e0919 Compare February 27, 2026 04:23
@liyishuai liyishuai marked this pull request as draft February 27, 2026 05:41
@liyishuai liyishuai force-pushed the fix-issue-14758-lease-race branch 3 times, most recently from 3e00844 to f7d5313 Compare February 27, 2026 09:54
@liyishuai liyishuai marked this pull request as ready for review February 27, 2026 09:55
@liyishuai liyishuai force-pushed the fix-issue-14758-lease-race branch from f7d5313 to fb7a207 Compare February 27, 2026 09:58
@liyishuai liyishuai force-pushed the fix-issue-14758-lease-race branch 2 times, most recently from 94af966 to 96b0ad3 Compare February 27, 2026 10:20
liyishuai and others added 2 commits February 27, 2026 18:25
Add TestV3LeaseKeysDeletedBeforeExpiry to test the invariant: if
KeepAlive returns TTL > 0, the attached keys must still exist within
the lease validity window.

The test uses two failpoints (beforeCheckpointInLeaseRenew and
afterLeaseRevokeDeleteKeys) to widen the race window between Renew
and Revoke. It lets the lease expire while a KeepAlive is delayed,
then checks whether the invariant holds.

Also adds the afterLeaseRevokeDeleteKeys gofail failpoint to
lessor.Revoke(), placed between key deletion and leaseMap removal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
Close revokec at the start of Revoke() (while holding le.mu) instead
of deferring it to function exit. This signals "revocation has started"
before keys are deleted.

In Renew(), add a non-blocking check on revokec after re-acquiring
le.mu and confirming the lease exists in leaseMap. If revokec is
closed, Renew returns ErrLeaseNotFound instead of refreshing the lease.

Both operations happen under le.mu, so they are properly ordered: once
Revoke closes revokec, any concurrent Renew will see it and refuse to
refresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
@liyishuai liyishuai force-pushed the fix-issue-14758-lease-race branch from 96b0ad3 to 4dbf457 Compare February 27, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Attached keys deleted before lease expired

2 participants