-
Notifications
You must be signed in to change notification settings - Fork 781
JWT token refresh failure on transient 502 is not retried, permanently breaking Platform connectivity #6967
Description
Bug report
Expected behavior and actual behavior
Expected: When Nextflow's JWT access token expires during a long-running pipeline launched from the Seqera Platform UI, the token refresh via POST /oauth/access_token should be resilient to transient HTTP errors (502, 503, 504, network timeouts). If the refresh endpoint is temporarily unreachable, Nextflow should retry the refresh before giving up.
Actual: A single transient failure (e.g. 502 Bad Gateway) during the token refresh attempt permanently fails the refresh with no retry. All subsequent heartbeat and progress requests fail with 401, and the pipeline transitions to "unknown" status in Platform. The pipeline continues running but Platform loses visibility into it.
The issue is in HxTokenManager.doRefreshToken() (lib-httpx). On any non-200 response or exception, it returns false immediately with no retry logic:
// HxTokenManager.doRefreshToken()
if (response.statusCode() == 200) {
final var result = handleRefreshResponse(response);
return result;
} else {
log.warn("Token refresh failed with status {}: {}", response.statusCode(), response.body());
return false; // No retry — transient errors treated as permanent
}The same gap exists in the TowerXAuth.refreshToken() path used for HTTP file access:
// TowerXAuth.groovy:87
if( resp.statusCode() != 200 )
return falseThis is particularly problematic because:
- JWT access tokens have a default TTL of 1 hour, so any pipeline running longer than that will need a refresh
- The refresh happens on the
Tower-thread, so a single failed refresh kills all Platform communication for that pipeline - There is no distinction between transient errors (502 — server temporarily down) and permanent errors (401 — invalid refresh token)
Steps to reproduce the problem
- Deploy Seqera Platform with the default JWT access token expiration (3600s / 1 hour)
- Launch a pipeline from the Platform UI that runs for longer than 1 hour
- During the pipeline run, briefly make the Platform backend unreachable (e.g. restart the backend pod, or simulate a 502 from the reverse proxy) at a moment when the JWT access token has already expired
- Observe that Nextflow's
Tower-threadlogs a 401 error and never recovers, even after the Platform backend is healthy again
The timing window is: the access token must have expired AND the refresh attempt must coincide with the transient outage. In environments with frequent pod scaling (e.g. Kubernetes HPA scaling the Platform backend up and down), this becomes likely over time.
Program output
[Tower-thread] WARN io.seqera.tower.plugin.TowerClient - Unexpected HTTP response.
Failed to send message to https://platform.example.com/api -- received
- status code : 401
- response msg: Unauthorized Seqera Platform API access -- Make sure you have specified the correct access token
- error cause : <html><head><title>502 Bad Gateway</title></head><body>
<center><h1>502 Bad Gateway</h1></center><hr><center>nginx</center></body></html>
Note the contradictory status code (401) and response body (502 nginx page). The 401 is from the original expired-token request; the 502 HTML is from the failed token refresh attempt against the temporarily-unreachable backend.
Environment
- Nextflow version: 26.03.0-edge (also affects earlier versions using lib-httpx with JWT refresh)
- Java version: OpenJDK 21
- Operating system: Linux (Kubernetes/Fargate, but reproducible on any OS)
- Bash version: N/A (not shell-dependent)
Additional context
Affected code paths:
HxTokenManager.doRefreshToken()inio.seqera:lib-httpx:2.1.0— the primary path used byTowerClientfor heartbeats and progress updates. No retry on non-200 or exception.TowerXAuth.refreshToken()inplugins/nf-tower— the path used byXFileSystemProviderfor HTTP file access. Same issue: returnsfalseon non-200 with no retry.
Suggested fix:
Add retry with exponential backoff to the token refresh path, consistent with how HxClient already retries on 429/500/502/503/504 for normal requests. Something like:
// In HxTokenManager.doRefreshToken()
int maxRetries = 3;
long delay = 1000; // 1 second initial
for (int attempt = 0; attempt <= maxRetries; attempt++) {
try {
var response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
if (response.statusCode() == 200) {
return handleRefreshResponse(response);
}
if (response.statusCode() == 401 || response.statusCode() == 403) {
// Permanent failure — refresh token is invalid, don't retry
log.warn("Token refresh rejected ({}): {}", response.statusCode(), response.body());
return false;
}
// Transient failure — retry
log.warn("Token refresh attempt {}/{} failed ({}), retrying in {}ms",
attempt + 1, maxRetries, response.statusCode(), delay);
} catch (Exception e) {
log.warn("Token refresh attempt {}/{} failed: {}", attempt + 1, maxRetries, e.getMessage());
}
if (attempt < maxRetries) Thread.sleep(delay);
delay = Math.min(delay * 2, 10_000);
}
return false;Key design point: 401/403 during refresh should NOT be retried (the refresh token itself is invalid), but 502/503/504/timeouts should be retried since they indicate transient infrastructure issues.