Skip to content

JWT token refresh failure on transient 502 is not retried, permanently breaking Platform connectivity #6967

@robsyme

Description

@robsyme

Bug report

Expected behavior and actual behavior

Expected: When Nextflow's JWT access token expires during a long-running pipeline launched from the Seqera Platform UI, the token refresh via POST /oauth/access_token should be resilient to transient HTTP errors (502, 503, 504, network timeouts). If the refresh endpoint is temporarily unreachable, Nextflow should retry the refresh before giving up.

Actual: A single transient failure (e.g. 502 Bad Gateway) during the token refresh attempt permanently fails the refresh with no retry. All subsequent heartbeat and progress requests fail with 401, and the pipeline transitions to "unknown" status in Platform. The pipeline continues running but Platform loses visibility into it.

The issue is in HxTokenManager.doRefreshToken() (lib-httpx). On any non-200 response or exception, it returns false immediately with no retry logic:

// HxTokenManager.doRefreshToken()
if (response.statusCode() == 200) {
    final var result = handleRefreshResponse(response);
    return result;
} else {
    log.warn("Token refresh failed with status {}: {}", response.statusCode(), response.body());
    return false;  // No retry — transient errors treated as permanent
}

The same gap exists in the TowerXAuth.refreshToken() path used for HTTP file access:

// TowerXAuth.groovy:87
if( resp.statusCode() != 200 )
    return false

This is particularly problematic because:

  1. JWT access tokens have a default TTL of 1 hour, so any pipeline running longer than that will need a refresh
  2. The refresh happens on the Tower-thread, so a single failed refresh kills all Platform communication for that pipeline
  3. There is no distinction between transient errors (502 — server temporarily down) and permanent errors (401 — invalid refresh token)

Steps to reproduce the problem

  1. Deploy Seqera Platform with the default JWT access token expiration (3600s / 1 hour)
  2. Launch a pipeline from the Platform UI that runs for longer than 1 hour
  3. During the pipeline run, briefly make the Platform backend unreachable (e.g. restart the backend pod, or simulate a 502 from the reverse proxy) at a moment when the JWT access token has already expired
  4. Observe that Nextflow's Tower-thread logs a 401 error and never recovers, even after the Platform backend is healthy again

The timing window is: the access token must have expired AND the refresh attempt must coincide with the transient outage. In environments with frequent pod scaling (e.g. Kubernetes HPA scaling the Platform backend up and down), this becomes likely over time.

Program output

[Tower-thread] WARN  io.seqera.tower.plugin.TowerClient - Unexpected HTTP response.
Failed to send message to https://platform.example.com/api -- received
  - status code : 401
  - response msg: Unauthorized Seqera Platform API access -- Make sure you have specified the correct access token
  - error cause : <html><head><title>502 Bad Gateway</title></head><body>
    <center><h1>502 Bad Gateway</h1></center><hr><center>nginx</center></body></html>

Note the contradictory status code (401) and response body (502 nginx page). The 401 is from the original expired-token request; the 502 HTML is from the failed token refresh attempt against the temporarily-unreachable backend.

Environment

  • Nextflow version: 26.03.0-edge (also affects earlier versions using lib-httpx with JWT refresh)
  • Java version: OpenJDK 21
  • Operating system: Linux (Kubernetes/Fargate, but reproducible on any OS)
  • Bash version: N/A (not shell-dependent)

Additional context

Affected code paths:

  1. HxTokenManager.doRefreshToken() in io.seqera:lib-httpx:2.1.0 — the primary path used by TowerClient for heartbeats and progress updates. No retry on non-200 or exception.
  2. TowerXAuth.refreshToken() in plugins/nf-tower — the path used by XFileSystemProvider for HTTP file access. Same issue: returns false on non-200 with no retry.

Suggested fix:

Add retry with exponential backoff to the token refresh path, consistent with how HxClient already retries on 429/500/502/503/504 for normal requests. Something like:

// In HxTokenManager.doRefreshToken()
int maxRetries = 3;
long delay = 1000; // 1 second initial
for (int attempt = 0; attempt <= maxRetries; attempt++) {
    try {
        var response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
        if (response.statusCode() == 200) {
            return handleRefreshResponse(response);
        }
        if (response.statusCode() == 401 || response.statusCode() == 403) {
            // Permanent failure — refresh token is invalid, don't retry
            log.warn("Token refresh rejected ({}): {}", response.statusCode(), response.body());
            return false;
        }
        // Transient failure — retry
        log.warn("Token refresh attempt {}/{} failed ({}), retrying in {}ms",
            attempt + 1, maxRetries, response.statusCode(), delay);
    } catch (Exception e) {
        log.warn("Token refresh attempt {}/{} failed: {}", attempt + 1, maxRetries, e.getMessage());
    }
    if (attempt < maxRetries) Thread.sleep(delay);
    delay = Math.min(delay * 2, 10_000);
}
return false;

Key design point: 401/403 during refresh should NOT be retried (the refresh token itself is invalid), but 502/503/504/timeouts should be retried since they indicate transient infrastructure issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions