Skip to content

Handling transient cases when reading a secret#1298

Open
aperez-worklytics wants to merge 7 commits into
rc-v0.6.7from
connection_setup_fixes
Open

Handling transient cases when reading a secret#1298
aperez-worklytics wants to merge 7 commits into
rc-v0.6.7from
connection_setup_fixes

Conversation

@aperez-worklytics

Copy link
Copy Markdown
Contributor

Trying to fix issue that happening in AWS when there are requests after a few and then the instance starts returning CONNECTION_SETUP.

I think the issue could come about how the ParameterStore / SecretsManagerSecretStore in AWS. Its client is marked as Singleton so the instance is the same across all the lifetime of the function. As a difference with GCP, where even SecretManagerConfigService is marked as singleton, the GCP client is created on each request:

try (SecretManagerServiceClient client = SecretManagerServiceClient.create()) {

What I think it is happening in AWS is that after a while (it should match less han an hour of running time) is that the internal token authentication between PS / SMS it is expired, waiting some requests until this is obtained a new fresh one. That matches what I see in the caller side where after some requests returning CONNECTION_SETUP during 1-3 seconds, then the function is working well after that.

So what this PR is doing is mostly:

  • trying to enforce reload on cache
  • detect access denied vs other error, considering it as transient, after couple of retries it will return 503 if still not working

Ideas are welcomed here

Fixes

Suddenly CONNECTION_SETUP error is killing the reader

Features

paste links to issues/tasks in project management

Logistics

paste links to issues/tasks in project management

Change implications

  • dependencies added/changed? yes (explain) / no
  • something important to note in future release notes?
    • NOTE in CHANGELOG.md anything that will show up in terraform plan/apply that isn't
      obviously a no-op?
    • breaking changes? if in module/example that is NOT marked alpha, requires major version
      change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce AWS “CONNECTION_SETUP” bursts caused by transient secret/config backend failures by distinguishing “missing value” from “temporary backend outage”, retaining previously cached values, and surfacing persistent transient failures as HTTP 503.

Changes:

  • Introduces TransientConfigException and updates AWS SSM/SecretsManager readers to throw it for non–access-denied failures.
  • Updates CachingConfigServiceDecorator to use refresh-based caching and to retain prior cached values when reloads fail transiently (plus retry logic on cold loads).
  • Adds tests for transient-reload behavior and maps exhausted transient failures to 503 Service Unavailable in ApiDataRequestHandler.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
java/impl/aws/src/main/java/co/worklytics/psoxy/aws/SecretsManagerSecretStore.java Distinguishes access-denied (optional) vs transient failures (throws TransientConfigException).
java/impl/aws/src/main/java/co/worklytics/psoxy/aws/ParameterStoreConfigService.java Distinguishes access-denied (optional) vs transient SSM failures (throws TransientConfigException).
java/impl/aws/src/main/java/co/worklytics/psoxy/aws/AwsExceptionUtils.java Adds helper to detect access-denied/forbidden AWS error codes.
java/core/src/main/java/co/worklytics/psoxy/gateway/TransientConfigException.java Adds a dedicated exception type for transient config backend failures.
java/core/src/main/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecorator.java Implements refresh-after-write caching, transient retry on cold load, and stale-value retention on transient reload failure.
java/core/src/test/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecoratorTest.java Adds tests covering stale retention, explicit transient exceptions, and cold-start retry behavior.
java/core/src/main/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandler.java Returns HTTP 503 when config loading fails transiently after internal retries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 76 to +78
@Override
public String load(ConfigProperty key) {
return delegate.getConfigPropertyAsOptional(key).orElse(NEGATIVE_VALUE);
public String load(@NonNull ConfigProperty key) {
TransientConfigException lastException = null;
Comment on lines +105 to +113
// Fallback heuristic for backends that still swallow exceptions
// (e.g. GCP SecretManagerConfigService): if the value was valid
// before but now comes back empty, assume transient and retain.
if (NEGATIVE_VALUE.equals(newValue) && !NEGATIVE_VALUE.equals(oldValue)) {
log.log(Level.WARNING,
"Backend returned empty for config property {0} which was previously set; assuming transient failure and retaining cached value",
key.name());
return Futures.immediateFuture(oldValue);
}
Comment on lines +117 to 118
} catch (DecryptionFailureException e) {
log.log(Level.SEVERE, "failed to read secret due to decryption error; check lambda's exec role perms for secret " + id);

@eschultink eschultink left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, seems reasonable ... appreciate the tests - as yes, I'm worried about edge cases here but OK.

minor changes requested.

.maximumSize(100)
.expireAfterWrite(defaultTtl.getSeconds(), TimeUnit.SECONDS)
.ticker(ticker)
.refreshAfterWrite(defaultTtl.getSeconds(), TimeUnit.SECONDS)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree this is better.

new Object[]{attempt + 1, MAX_TRANSIENT_RETRIES, key.name()});
}
}
throw Objects.requireNonNull(lastException);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointelss, right?

public String load(@NonNull ConfigProperty key) {
TransientConfigException lastException = null;
for (int attempt = 0; attempt < MAX_TRANSIENT_RETRIES; attempt++) {
if (attempt > 0) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't really get why don't get rid of this condition, move the sleep to end of loop instead of beginning.

@aperez-worklytics aperez-worklytics left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, feedback applied

Base automatically changed from rc-v0.6.5 to main June 17, 2026 16:19
@eschultink eschultink changed the base branch from main to rc-v0.6.6 June 17, 2026 16:30
Base automatically changed from rc-v0.6.6 to main June 18, 2026 20:25
@eschultink eschultink changed the base branch from main to rc-v0.6.7 June 18, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants