lease: fix race condition between KeepAlive and lease expiry#21389
lease: fix race condition between KeepAlive and lease expiry#21389liyishuai wants to merge 2 commits intoetcd-io:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liyishuai The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @liyishuai. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
45ecb2c to
7fac93b
Compare
7fac93b to
27e0919
Compare
3e00844 to
f7d5313
Compare
f7d5313 to
fb7a207
Compare
94af966 to
96b0ad3
Compare
Add TestV3LeaseKeysDeletedBeforeExpiry to test the invariant: if KeepAlive returns TTL > 0, the attached keys must still exist within the lease validity window. The test uses two failpoints (beforeCheckpointInLeaseRenew and afterLeaseRevokeDeleteKeys) to widen the race window between Renew and Revoke. It lets the lease expire while a KeepAlive is delayed, then checks whether the invariant holds. Also adds the afterLeaseRevokeDeleteKeys gofail failpoint to lessor.Revoke(), placed between key deletion and leaseMap removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
Close revokec at the start of Revoke() (while holding le.mu) instead of deferring it to function exit. This signals "revocation has started" before keys are deleted. In Renew(), add a non-blocking check on revokec after re-acquiring le.mu and confirming the lease exists in leaseMap. If revokec is closed, Renew returns ErrLeaseNotFound instead of refreshing the lease. Both operations happen under le.mu, so they are properly ordered: once Revoke closes revokec, any concurrent Renew will see it and refuse to refresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
96b0ad3 to
4dbf457
Compare
Problem
A race condition exists between
KeepAlive(Renew) and lease expiry revocation.When a lease expires, the server's
revokeExpiredLeasesgoroutine proposes aLeaseRevokethrough Raft. The apply goroutine then executesRevoke(), which proceeds in two steps:leaseMapA concurrent
KeepAliverequest handled by a separate gRPC goroutine can slip in between these two steps. It finds the lease still inleaseMapand successfully renews it, returning a positive TTL to the client — even though the attached keys are already deleted. The client is misled into believing the lease and its keys are still alive.The race window is small in practice, but real. The integration test reproduces it deterministically using two gofail failpoints that widen the window.
Fixes #14758
Solution
Close
lease.revokecat the start ofRevoke(), before releasing the lock to delete keys. InRenew(), after acquiring the lock, check whetherrevokecis closed. If it is, returnErrLeaseNotFoundimmediately instead of refreshing the lease.This ensures any
Renew()that races withRevoke()observes the revocation signal and reports the lease as gone, rather than returning a positive TTL with the keys already deleted.AI Disclosure
AI tools (Claude Opus 4.6 via Claude Code) were used in preparing this PR. This is disclosed per the AI guidance. All changes have been reviewed and verified by the human author.
The prompt used: "Fix etcd issue #14758: a race condition where KeepAlive returns a positive TTL after lease keys have been deleted during revocation. Add an integration test reproducing the race, then implement a fix using the revokec channel to signal revocation early in Revoke() and check it in Renew()."