JWT token refresh failure on transient 502 is not retried, permanently breaking Platform connectivity

## Bug report

### Expected behavior and actual behavior

**Expected:** When Nextflow's JWT access token expires during a long-running pipeline launched from the Seqera Platform UI, the token refresh via `POST /oauth/access_token` should be resilient to transient HTTP errors (502, 503, 504, network timeouts). If the refresh endpoint is temporarily unreachable, Nextflow should retry the refresh before giving up.

**Actual:** A single transient failure (e.g. 502 Bad Gateway) during the token refresh attempt permanently fails the refresh with no retry. All subsequent heartbeat and progress requests fail with 401, and the pipeline transitions to "unknown" status in Platform. The pipeline continues running but Platform loses visibility into it.

The issue is in `HxTokenManager.doRefreshToken()` (lib-httpx). On any non-200 response or exception, it returns `false` immediately with no retry logic:

```java
// HxTokenManager.doRefreshToken()
if (response.statusCode() == 200) {
    final var result = handleRefreshResponse(response);
    return result;
} else {
    log.warn("Token refresh failed with status {}: {}", response.statusCode(), response.body());
    return false;  // No retry — transient errors treated as permanent
}
```

The same gap exists in the `TowerXAuth.refreshToken()` path used for HTTP file access:

```groovy
// TowerXAuth.groovy:87
if( resp.statusCode() != 200 )
    return false
```

This is particularly problematic because:
1. JWT access tokens have a default TTL of 1 hour, so any pipeline running longer than that will need a refresh
2. The refresh happens on the `Tower-thread`, so a single failed refresh kills all Platform communication for that pipeline
3. There is no distinction between transient errors (502 — server temporarily down) and permanent errors (401 — invalid refresh token)

### Steps to reproduce the problem

1. Deploy Seqera Platform with the default JWT access token expiration (3600s / 1 hour)
2. Launch a pipeline from the Platform UI that runs for longer than 1 hour
3. During the pipeline run, briefly make the Platform backend unreachable (e.g. restart the backend pod, or simulate a 502 from the reverse proxy) at a moment when the JWT access token has already expired
4. Observe that Nextflow's `Tower-thread` logs a 401 error and never recovers, even after the Platform backend is healthy again

The timing window is: the access token must have expired AND the refresh attempt must coincide with the transient outage. In environments with frequent pod scaling (e.g. Kubernetes HPA scaling the Platform backend up and down), this becomes likely over time.

### Program output

```
[Tower-thread] WARN  io.seqera.tower.plugin.TowerClient - Unexpected HTTP response.
Failed to send message to https://platform.example.com/api -- received
  - status code : 401
  - response msg: Unauthorized Seqera Platform API access -- Make sure you have specified the correct access token
  - error cause : <html><head><title>502 Bad Gateway</title></head><body>
    <center><h1>502 Bad Gateway</h1></center><hr><center>nginx</center></body></html>
```

Note the contradictory status code (401) and response body (502 nginx page). The 401 is from the original expired-token request; the 502 HTML is from the failed token refresh attempt against the temporarily-unreachable backend.

### Environment

* Nextflow version: 26.03.0-edge (also affects earlier versions using lib-httpx with JWT refresh)
* Java version: OpenJDK 21
* Operating system: Linux (Kubernetes/Fargate, but reproducible on any OS)
* Bash version: N/A (not shell-dependent)

### Additional context

**Affected code paths:**

1. `HxTokenManager.doRefreshToken()` in `io.seqera:lib-httpx:2.1.0` — the primary path used by `TowerClient` for heartbeats and progress updates. No retry on non-200 or exception.
2. `TowerXAuth.refreshToken()` in `plugins/nf-tower` — the path used by `XFileSystemProvider` for HTTP file access. Same issue: returns `false` on non-200 with no retry.

**Suggested fix:**

Add retry with exponential backoff to the token refresh path, consistent with how `HxClient` already retries on 429/500/502/503/504 for normal requests. Something like:

```java
// In HxTokenManager.doRefreshToken()
int maxRetries = 3;
long delay = 1000; // 1 second initial
for (int attempt = 0; attempt <= maxRetries; attempt++) {
    try {
        var response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
        if (response.statusCode() == 200) {
            return handleRefreshResponse(response);
        }
        if (response.statusCode() == 401 || response.statusCode() == 403) {
            // Permanent failure — refresh token is invalid, don't retry
            log.warn("Token refresh rejected ({}): {}", response.statusCode(), response.body());
            return false;
        }
        // Transient failure — retry
        log.warn("Token refresh attempt {}/{} failed ({}), retrying in {}ms",
            attempt + 1, maxRetries, response.statusCode(), delay);
    } catch (Exception e) {
        log.warn("Token refresh attempt {}/{} failed: {}", attempt + 1, maxRetries, e.getMessage());
    }
    if (attempt < maxRetries) Thread.sleep(delay);
    delay = Math.min(delay * 2, 10_000);
}
return false;
```

Key design point: 401/403 during refresh should NOT be retried (the refresh token itself is invalid), but 502/503/504/timeouts should be retried since they indicate transient infrastructure issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JWT token refresh failure on transient 502 is not retried, permanently breaking Platform connectivity #6967

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JWT token refresh failure on transient 502 is not retried, permanently breaking Platform connectivity #6967

Description

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions