Handling transient cases when reading a secret#1298
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce AWS “CONNECTION_SETUP” bursts caused by transient secret/config backend failures by distinguishing “missing value” from “temporary backend outage”, retaining previously cached values, and surfacing persistent transient failures as HTTP 503.
Changes:
- Introduces
TransientConfigExceptionand updates AWS SSM/SecretsManager readers to throw it for non–access-denied failures. - Updates
CachingConfigServiceDecoratorto use refresh-based caching and to retain prior cached values when reloads fail transiently (plus retry logic on cold loads). - Adds tests for transient-reload behavior and maps exhausted transient failures to
503 Service UnavailableinApiDataRequestHandler.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| java/impl/aws/src/main/java/co/worklytics/psoxy/aws/SecretsManagerSecretStore.java | Distinguishes access-denied (optional) vs transient failures (throws TransientConfigException). |
| java/impl/aws/src/main/java/co/worklytics/psoxy/aws/ParameterStoreConfigService.java | Distinguishes access-denied (optional) vs transient SSM failures (throws TransientConfigException). |
| java/impl/aws/src/main/java/co/worklytics/psoxy/aws/AwsExceptionUtils.java | Adds helper to detect access-denied/forbidden AWS error codes. |
| java/core/src/main/java/co/worklytics/psoxy/gateway/TransientConfigException.java | Adds a dedicated exception type for transient config backend failures. |
| java/core/src/main/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecorator.java | Implements refresh-after-write caching, transient retry on cold load, and stale-value retention on transient reload failure. |
| java/core/src/test/java/co/worklytics/psoxy/gateway/impl/CachingConfigServiceDecoratorTest.java | Adds tests covering stale retention, explicit transient exceptions, and cold-start retry behavior. |
| java/core/src/main/java/co/worklytics/psoxy/gateway/impl/ApiDataRequestHandler.java | Returns HTTP 503 when config loading fails transiently after internal retries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @Override | ||
| public String load(ConfigProperty key) { | ||
| return delegate.getConfigPropertyAsOptional(key).orElse(NEGATIVE_VALUE); | ||
| public String load(@NonNull ConfigProperty key) { | ||
| TransientConfigException lastException = null; |
| // Fallback heuristic for backends that still swallow exceptions | ||
| // (e.g. GCP SecretManagerConfigService): if the value was valid | ||
| // before but now comes back empty, assume transient and retain. | ||
| if (NEGATIVE_VALUE.equals(newValue) && !NEGATIVE_VALUE.equals(oldValue)) { | ||
| log.log(Level.WARNING, | ||
| "Backend returned empty for config property {0} which was previously set; assuming transient failure and retaining cached value", | ||
| key.name()); | ||
| return Futures.immediateFuture(oldValue); | ||
| } |
| } catch (DecryptionFailureException e) { | ||
| log.log(Level.SEVERE, "failed to read secret due to decryption error; check lambda's exec role perms for secret " + id); |
eschultink
left a comment
There was a problem hiding this comment.
OK, seems reasonable ... appreciate the tests - as yes, I'm worried about edge cases here but OK.
minor changes requested.
| .maximumSize(100) | ||
| .expireAfterWrite(defaultTtl.getSeconds(), TimeUnit.SECONDS) | ||
| .ticker(ticker) | ||
| .refreshAfterWrite(defaultTtl.getSeconds(), TimeUnit.SECONDS) |
| new Object[]{attempt + 1, MAX_TRANSIENT_RETRIES, key.name()}); | ||
| } | ||
| } | ||
| throw Objects.requireNonNull(lastException); |
| public String load(@NonNull ConfigProperty key) { | ||
| TransientConfigException lastException = null; | ||
| for (int attempt = 0; attempt < MAX_TRANSIENT_RETRIES; attempt++) { | ||
| if (attempt > 0) { |
There was a problem hiding this comment.
don't really get why don't get rid of this condition, move the sleep to end of loop instead of beginning.
aperez-worklytics
left a comment
There was a problem hiding this comment.
Ok, feedback applied
Trying to fix issue that happening in AWS when there are requests after a few and then the instance starts returning CONNECTION_SETUP.
I think the issue could come about how the
ParameterStore/SecretsManagerSecretStorein AWS. Its client is marked asSingletonso the instance is the same across all the lifetime of the function. As a difference with GCP, where evenSecretManagerConfigServiceis marked as singleton, the GCP client is created on each request:psoxy/java/impl/gcp/src/main/java/co/worklytics/psoxy/SecretManagerConfigService.java
Line 93 in 82542de
What I think it is happening in AWS is that after a while (it should match less han an hour of running time) is that the internal token authentication between PS / SMS it is expired, waiting some requests until this is obtained a new fresh one. That matches what I see in the caller side where after some requests returning CONNECTION_SETUP during 1-3 seconds, then the function is working well after that.
So what this PR is doing is mostly:
Ideas are welcomed here
Fixes
Suddenly CONNECTION_SETUP error is killing the reader
Features
Logistics
Change implications
CHANGELOG.mdanything that will show up interraform plan/applythat isn'tobviously a no-op?
alpha, requires major versionchange