Skip to content

Fail lock acquisition if service already holds one.#24782

Merged
patrickmann merged 6 commits intomasterfrom
refactor/fail-lock-acquisition-if-factory-holds-one
Mar 12, 2026
Merged

Fail lock acquisition if service already holds one.#24782
patrickmann merged 6 commits intomasterfrom
refactor/fail-lock-acquisition-if-factory-holds-one

Conversation

@dennisoelkers
Copy link
Member

Description

Motivation and Context

The commit adds a safety check to prevent the RefreshingLockService from acquiring a new lock when it already holds an existing one. Before scheduling a new lock, the code now checks if the current lock is not null, and if so, throws an IllegalStateException indicating that acquiring a new lock would cause the existing lock to be lost.

This is a defensive programming improvement that prevents potential bugs where a service might accidentally overwrite an active lock, which could lead to lock management issues in the cluster.

/nocl Internal refactoring.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have requested a documentation update.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@dennisoelkers dennisoelkers requested a review from Copilot January 22, 2026 10:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a defensive check to prevent the RefreshingLockService from accidentally overwriting an existing lock. When attempting to schedule a new lock, the service now validates that it doesn't already hold a lock, throwing an IllegalStateException if it does. This prevents potential lock management bugs in the cluster where an active lock could be lost.

Changes:

  • Added a null check for existing locks before scheduling a new lock
  • Throws IllegalStateException with descriptive message if a lock is already held

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +90
if (lock != null) {
throw new IllegalStateException("Unable to acquire new lock, already holding lock that would get lost: " + lock);
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new lock validation logic should be covered by a test case that verifies the IllegalStateException is thrown when attempting to schedule a lock while already holding one.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: Please implement a test like that.

Copy link
Contributor

Copilot AI commented Jan 26, 2026

@dennisoelkers I've opened a new pull request, #24830, to work on those changes. Once the pull request is ready, I'll request review from you.

…tion (#24830)

* Initial plan

* Add tests for lock acquisition failure when already holding a lock

Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>

* Use @ExtendWith and extract mock setup to beforeEach

Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>

* Remove accidentally committed test output log

Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>

* Move locks to field initializers and mocks to test methods

Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>

* Remove test_output.log from .gitignore

Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>

* Use real single-threaded executor instead of mock

Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: dennisoelkers <41929+dennisoelkers@users.noreply.github.com>
@dennisoelkers dennisoelkers marked this pull request as ready for review March 10, 2026 12:27
@dennisoelkers dennisoelkers requested a review from bernd March 10, 2026 12:27
Copy link
Contributor

@patrickmann patrickmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock leak issue seems legit:

Bug: Lock leak when the guard triggers

The check is in the wrong place. By the time scheduleLock() is called, the new lock has already been acquired from lockService.lock(). When the IllegalStateException fires, the newly acquired lock is never released — it's leaked.

// In acquireAndKeepLock:
Optional<Lock> optionalLock = lockService.lock(resource, maxConcurrency); // Lock acquired!
if (optionalLock.isEmpty()) {
    throw new AlreadyLockedException(...);
}
scheduleLock(optionalLock.get()); // IllegalStateException thrown here → lock leaked

The second lock is now held in the underlying LockService (e.g. in MongoDB) but nobody will ever release or refresh it. This could block other nodes from acquiring that resource.

Fix: Move the guard to the beginning of both acquireAndKeepLock methods, before calling lockService.lock():

public void acquireAndKeepLock(String resource, int maxConcurrency) throws AlreadyLockedException {
    if (lock != null) {
        throw new IllegalStateException("Unable to acquire new lock, already holding lock that would get lost: " + lock);
    }
    Optional<Lock> optionalLock = lockService.lock(resource, maxConcurrency);
    // ...
}

Or alternatively, catch the exception and release the new lock before rethrowing.

Test gap

The tests verify the exception is thrown but don't assert that the second lock is properly cleaned up (released). Given the bug above, adding such an assertion (e.g. verify(lockService).unlock(secondLock)) would actually fail and expose the leak.

Also, the mock setup for the second lock (when(lockService.lock(eq("second-resource"), ...))) is unnecessary if the check is moved before the lockService.lock() call, since it would never be reached.

Minor notes

  • The lock field is not volatile and access is unsynchronized — this is pre-existing and out of scope, but worth noting since the new guard reads lock without synchronization.

Copy link
Contributor

@patrickmann patrickmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Since this is a defensive fix there's no way to test it in action; unit tests are fine.

@patrickmann patrickmann merged commit df530c9 into master Mar 12, 2026
23 checks passed
@patrickmann patrickmann deleted the refactor/fail-lock-acquisition-if-factory-holds-one branch March 12, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants