Skip to content

polling for challenge ready and certs with timeout and retry-after #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

christianhoelzl
Copy link
Contributor

This pull requests adds two polling methods to the Order API for challenge ready and the final certificate with timeouts.
The polling characteristic is different from Order::poll() and provides a total timeout and maximum polling interval in addition to the initial exponential polling delay. This aims at usability, the total timeout is easier to understand and the maximum polling interval yields a result more responsive when calling the API (especially for long timeouts).

@djc
Copy link
Owner

djc commented Apr 29, 2025

I agree we should make this stuff easier but I don't think this exact change is the way to go about it.

First off, I'm not a fan of the duplication in this PR, so I think we should deduplicate most of the code (which should be fairly straightforward).

Second, maybe we should have a bit of a discussion about what the optimal way to represent the timeout parameters in the API is. I agree that including the total timeout might make sense. I was thinking it might be useful to have a type, this could represent the current API:

struct Backoff {
     initial_delay: Duration,
     tries: usize,
}

And also provide some constants, like:

const DEFAULT_READY: Backoff = Backoff { initial_delay: Duration::from_millis(250), tries: 5 };

As for how we define backoffs, I think there are multiple entry points:

  • Current approach: require an initial delay and a number of attempts, constant backoff factor of 2. We found the initial delay to be important when trying to minimize the median latency of the entire certificate provisioning process.
  • This PR: working backwards from a total timeout, with an backoff factor (Duration::mul_f32()). Making the total timeout more clearly defined is helpful, but need to separately define the initial delay if you also want to optimize it.

Maybe we could look at how some popular crates do it, but I don't think the cost/benefit of depending on an extra crate for this in instant-acme makes sense.

Also, the API should be such that downstreams can still implement their own, so the method we implement is just an ease-of-use feature and doesn't lock the caller into it -- that's how/why we can be a little opinionated.

@cpu
Copy link
Collaborator

cpu commented Apr 29, 2025

I would also suggest we should consider implementing support for Retry-After header processing to let the server inform the polling interval since it has the best understanding of when the request is likely to succeed.

@djc
Copy link
Owner

djc commented Apr 29, 2025

I would also suggest we should consider implementing support for Retry-After header processing to let the server inform the polling interval since it has the best understanding of when the request is likely to succeed.

Definitely! Do ACME servers commonly support that?

@cpu
Copy link
Collaborator

cpu commented Apr 29, 2025

Do ACME servers commonly support that?

Pebble and Bouder (Let's Encrypt) do for sure. I'm honestly not sure what the rest of the server-side ecosystem looks like.

@djc
Copy link
Owner

djc commented Apr 29, 2025

Do ACME servers commonly support that?

Pebble and Bouder (Let's Encrypt) do for sure. I'm honestly not sure what the rest of the server-side ecosystem looks like.

That's enough for me. 👍

@christianhoelzl
Copy link
Contributor Author

I'll do some research if other ACME servers support the retry-after header and come back with this info.

At least in https://datatracker.ietf.org/doc/html/rfc8555#section-8.2 it is described as a must reading the challenge

The server MUST provide information about its retry state to the
client via the "error" field in the challenge and the Retry-After
HTTP header field in response to requests to the challenge resource.

@djc
Copy link
Owner

djc commented Apr 30, 2025

I'll do some research if other ACME servers support the retry-after header and come back with this info.

At least in https://datatracker.ietf.org/doc/html/rfc8555#section-8.2 it is described as a must reading the challenge

The server MUST provide information about its retry state to the
client via the "error" field in the challenge and the Retry-After
HTTP header field in response to requests to the challenge resource.

That seems about enough research to me -- I wouldn't worry about checking other servers.

@christianhoelzl
Copy link
Contributor Author

christianhoelzl commented May 5, 2025

Here is a summary of some testing and research:

ACME server side testing

  • Smallstep CA (version 0.28.3) does not support retry-after header
  • Pebble CA (version 2.7.0) does not support retry-after header
  • Searching for retry-after was done by adding a dbg!(&rsp.parts); into Order::refresh method, here's an example Output.
&rsp.parts = Parts {
    status: 200,
    version: HTTP/2.0,
    headers: {
        "cache-control": "public, max-age=0, no-cache",
        "content-type": "application/json; charset=utf-8",
        "link": "<https://[::1]:5601/dir>;rel=\"index\"",
        "replay-nonce": "f9g3yJqkb6L5PqZGH2vDAg",
        "content-length": "686",
        "date": "Mon, 05 May 2025 12:27:24 GMT",
    },
}

Retry-after header

  • The value of the header can be seconds or http-date as described here
  • AFAIK the http-date is rarely used
  • Parsing of http-date could be achieved by adding the httpdate dependency

ACME4J Client

My conclusion is that the Acme4j algorithm is fine, but should be extended by the httpdate format (and returning an existing ACME error code). An initial sleep of the exponential polling algorithm could be added outside the polling call. Using the retry-after is an improvement in terms of rate-limiting compared to the current polling.

@cpu
Copy link
Collaborator

cpu commented May 5, 2025

Pebble CA (version 2.7.0) does not support retry-after header

Ah interesting. I thought there was support for this but it looks like Pebble sets it for ARI, but not in other contexts.

@christianhoelzl
Copy link
Contributor Author

Pebble sets it for ARI

just a side comment: The Retry-After value 6*time.Hour/time.Second in this link looks like that division by zero is possible

@christianhoelzl
Copy link
Contributor Author

christianhoelzl commented May 6, 2025

I updated both methods in the pull request to respect retry-after and http-date dependency. For shure not the final solution, but it gives an impression what it looks like with retry-after support. This wait_ready is similar to acme4j.

@christianhoelzl
Copy link
Contributor Author

I want to continue the API discussion.

There are three methods on the public Order API for polling in the current commit.

Method Visibility Polling Configuration Comment
poll public exponential initial tmo & tries opitmized for latency
wait_ready public timeout & retry_after total tmo this PR
wait_certificate public timeout & retry_after total tmo this PR

I see two options to continue with the API:

  1. Introduce an new enum PollingStrategy with both polling options and use this as an parameter for poll and wait_certificate. wait_ready becomes obsolete. Unfortunately this is a breaking API change of poll and I do not know if this is an option for a minor version change to 0.8. This is close to your first comment on the PR.
  2. Same as 1, but keep the poll method as it is not to break the current API.

How to proceed from here? I prefer option 1 as the next iteration in this PR.

@djc
Copy link
Owner

djc commented May 12, 2025

We already broke compatibility on main, so I'm happy to look at your implementation of option 1.

Copy link
Collaborator

@cpu cpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable to me. I agree with Djc that the breaking change you propose makes sense.

There was a Pebble failure in CI (a rejected nonce) but I think it's unrelated and went away after a kick (I think there's a place we don't handle bad nonce retries correctly?).

@christianhoelzl
Copy link
Contributor Author

The breaking api change is implemented as option 1 and a pebble test is added.

@christianhoelzl christianhoelzl changed the title polling for challenge ready and certs with timeout polling for challenge ready and certs with timeout and retry-after May 13, 2025
@christianhoelzl
Copy link
Contributor Author

There was a Pebble failure in CI (a rejected nonce) but I think it's unrelated and went away after a kick (I think there's a place we don't handle bad nonce retries correctly?).

I found one issue (fixed in #105)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants