Skip to content

helios/deploy flaked setting up external DNS routes? #9206

@iximeow

Description

@iximeow

in CI for #9204: https://github.com/oxidecomputer/omicron/pull/9204/checks?check_run_id=52638510286

or buildomat: https://buildomat.eng.oxide.computer/wg/0/details/01K7FEEGS766GRVD9CVQN335B2/Y6RoWuCOWY9T52pA6RzMDnb6PwDV7bNwme0VY4fo1QMJQg0E/01K7FEF1PJQFT98MRGX5C1JHNG

the immediate issue is here, but is a bit downstream of what went wrong:
https://buildomat.eng.oxide.computer/wg/0/details/01K7FEEGS766GRVD9CVQN335B2/Y6RoWuCOWY9T52pA6RzMDnb6PwDV7bNwme0VY4fo1QMJQg0E/01K7FEF1PJQFT98MRGX5C1JHNG#S1369

Excerpt from the log showing the failure:

1367  2025-10-13T19:55:00.509Z  2025-10-13 19:55:00.497596603 UTC: attempting to log into API 
1368  2025-10-13T19:55:15.541Z  2025-10-13 19:55:15.529168673 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local)
1369  2025-10-13T19:55:16.546Z  Error: logging in
1370  2025-10-13T19:55:16.546Z  
1371  2025-10-13T19:55:16.546Z  Caused by: 
1372  2025-10-13T19:55:16.822Z      timed out after 609.319694462s

So on the surface we failed to log into the API after ten minutes. This looks a lot like #6772, and like other folks there, the deploy script is just reporting that Nexus isn't reachable after ten minutes.

Simultaneously, one of the external DNS zones failed to get a route set up: https://buildomat.eng.oxide.computer/wg/0/artefact/01K7FEEGS766GRVD9CVQN335B2/Y6RoWuCOWY9T52pA6RzMDnb6PwDV7bNwme0VY4fo1QMJQg0E/01K7FEF1PJQFT98MRGX5C1JHNG/01K7FJD9ASSQVHKDNH9QG5QZGP/oxide-opte-interface-setup:default.log?format=x-bunyan

zone-setup: failed to ensure OPTE gateway route on interface opte2 with gateway 172.30.1.1 and IP 172.30.1.5

Caused by:
Command [/usr/sbin/route add -host 172.30.1.1 172.30.1.5 -interface -ifp opte2] executed and failed with status: exit status: 128  stdout: add host 172.30.1.1: gateway 172.30.1.5: Network is unreachable

since the other external DNS zone seems fine (19386011-2bd0-4d7f-bcf9-f34d6cc33633, with logs 19386011-2bd0-4d7f-bcf9-f34d6cc33633/root/var/svc/log/oxide-external_dns:default.log) I'm not really sure how this ends up with the CLI not being able to talk to the partially-up control plane. But I can't see why adding that route would fail. it's all local??

A re-run did succeed: https://github.com/oxidecomputer/omicron/runs/52650447795

Metadata

Metadata

Assignees

No one assigned

    Labels

    Test FlakeTests that work. Wait, no. Actually yes. Hang on. Something is broken.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions