docs(adr): remote credential injection design for #773 (eng-reviewed)#822
docs(adr): remote credential injection design for #773 (eng-reviewed)#822kaiweijw wants to merge 12 commits into
Conversation
|
Follow-up commit What changed
Eng-review audit trail (moved from the ADR per item 6)The ADR was eng-reviewed via
3 scope expansions were folded during review: Phase 3.5 (multi-node fan-out), Phase 4.5 (code-integrity infrastructure including the new standalone-page subsection), and an explicit Test Strategy section. Eng review verdict: CLEAR — ready to implement. Final ADR is 472 lines. Reviewers: read the file end-to-end and push back on anything that looks wrong before implementation starts. |
…s-model audit Two background agents (codex/gpt-5.5 and opencode/glm-5.1) independently reviewed PR #822 and surfaced findings the /plan-eng-review missed. Cross-model agreement on the P0; the other 13 items are confirmed against actual code. P0 — Error code collision (both agents): The ADR reserved 8004-8009 for remote credential injection but backend/src/errors/mod.rs:374-375 already assigns 8004 to NodeCredentialMissing and 8005 to WsProxyDownstream. Implementing the ADR as written would have either duplicated public error codes or required a breaking renumber of live errors. Shift the RCI range to 8006-8011 with the following assignments: 8006 PendingCredentialDecryptFailed 8007 PendingCredentialVersionUnsupported 8008 PendingCredentialCiphertextTooLarge 8009 PendingCredentialPubkeyAwaiting (renamed from PubkeyNotPosted to reflect the polling-with-backoff semantics, not a terminal "not posted" state) 8010 PendingCredentialNodeOffline 8011 PendingCredentialQueueFull Test rows and failure mode table updated to reference the new numbers. P1 — T1 threat-model overclaim (codex): G6 and the T1 row claim protection against a fully-compromised NyxID server via Phase 4.5. But the standalone HTML, SRI hashes, fingerprint display, and verify button are all served by NyxID itself. A fully-compromised server can substitute them in lockstep. Downgrade T1 wording to make the operational dependency explicit: Phase 4.5 provides "detection assuming the admin verifies the fingerprint out-of-band against the signed manifest at a separate origin." Without admin verification, T1 degrades to T2 in practice. P2 — supported_features vs per-pending pubkey-readiness contradiction (codex): Section said "No timing guesses, no polling fallback" while error 8007 said "Frontend polls or falls back." Split into two distinct signals: feature detection (sync, cached, via supported_features set on the Node record) and per-pending pubkey readiness (async, polling with exponential backoff up to 30s, error code 8009 PendingCredentialPubkeyAwaiting). Feature detection is fast; polling is bounded to nodes that have already advertised crypto_v1. P2 — /ciphertext sync semantics underspecified (codex): Sequence diagram showed 4xx on node decrypt failure but didn't specify how the HTTP handler waits for the node. Reuse the existing send_credential_update_and_wait pattern at node_ws_manager.rs:1558 (allocate request_id, register oneshot waiter, send WSS frame, await ack with timeout). New section "Sync vs async response semantics" specifies the status-code mapping for consume / decrypt_failed / timeout / offline. P2 — Frame name correction (codex): Sequence diagram used pending_credential_available (singular, with metadata). Actual code at ws_client.rs:433 sends pending_credentials_available (plural, no metadata) as a nudge; node pulls details via existing GET /api/v1/node-agent/pending-credentials. Updated the diagram to reuse the existing nudge/pull pattern with pending_credential_pubkey added as a new WSS frame the node posts per pending credential lacking a sealed privkey. P2 — Ciphertext encoding for JSON WSS (opencode): ciphertext: Vec<u8> with no encoding spec — JSON WSS frames need base64url. Added explicit doc-comment in the CryptoBundle struct noting the base64url-string-on-the-wire / BSON-binary-in-MongoDB serde-adapter pattern. P2 — supported_features doesn't exist yet (opencode): Phase 1 work; ADR was writing about it as if it already existed. Added "(new field, added in Phase 1)" annotation. P2 — AGENTS.md actual-state correction (opencode): The AGENTS.md bullet "Error codes 8000-8003 are reserved" is itself out of date. ADR now acknowledges this explicitly and tells implementer to update AGENTS.md to reflect 8000-8005 (existing) plus 8006-8011 (new RCI range). P3 cleanups: - findAndModify -> find_one_and_update (idiomatic for the Rust driver) - srl_* -> sri_* typo in test names (3 occurrences) - ChronoAIProject/issues/ -> ChronoAIProject/NyxID/issues/ in reference links - Per-pending rate limit mechanism specified (in-memory DashMap<PendingId, RateLimitState>, TTL-ephemeral, distinct from global RATE_LIMIT_PER_SECOND middleware) - Standalone HTML weight target loosened from ~5 KB to ~30 KB gzipped (more realistic given @noble/curves x25519 alone is ~6 KB minified) with lazy-load fallback recommendation when SubtleCrypto is available natively - Multi-node fan-out: explicit behavior when expires_at fires during partial_decrypted (logical credential goes to expired, previously-accepted node credentials are NOT rolled back per idempotency principle) No protocol changes. No threat-model changes (only one wording clarification for T1's operational dependency). All edits are internal-consistency, encoding/transport spec sharpening, and naming/reference corrections. Calibration note: /plan-eng-review missed the P0 because it trusted the AGENTS.md "8000-8003 reserved" line at face value instead of grepping backend/src/errors/mod.rs directly. Cross-model independent review with code-level verification caught it. Worth adopting as a default check before locking error-code ranges in future ADRs.
|
Follow-up commit How this got caughtTwo background agents (codex/gpt-5.5 xhigh, opencode/glm-5.1) ran review passes against this PR independently. Both surfaced the same P0 plus a long tail of P2/P3 items that my earlier P0 — Error code collisionThe ADR reserved Both Other findings (all confirmed against code, fixed)
Calibration
Diff stats
|
Design document for #773 (org-admin remote credential injection without SSH-to-node). Status: proposed, eng-reviewed via /plan-eng-review with 9 issues raised and resolved, 0 unresolved, 0 critical gaps. Protocol shape: browser-driven end-to-end encrypted relay using X25519 ECDH + HKDF-SHA256 + XChaCha20-Poly1305. NyxID server is strictly a passthrough — never derives keys, never sees plaintext. Node generates an ephemeral X25519 keypair per pending credential, sealed at rest by the node's long-lived auth key (survives restart, dropped on consume). Eng review folded three scope expansions into the design: - Multi-node fan-out with all-N semantics (Phase 3.5) - Frontend code-integrity infrastructure: SRI hashes, signed release channel, admin verification UX (Phase 4.5) - Explicit Test Strategy section enumerating 25 mandatory tests by layer (node crypto, backend endpoints, frontend, cross-language interop fixtures, e2e, audit, regression, eval) Hardened design decisions captured: - Persistent sealed privkey on node (survives restart) instead of in-memory-only - Manual-accept gate is the default; auto-accept requires a second opt-in (preserves separation-of-duties for existing orgs) - First-push-wins atomicity + per-pending rate limit (1 successful POST, 3 failed in 60s, 5min lockout) - 16 KB ciphertext cap, error codes 8004-8009 reserved - CryptoBundle as a typed sub-document on NodePendingCredential - `supported_features` set on Node for version detection Closes the design phase of #773. Implementation tracked in subsequent PRs against the phases enumerated in the doc. Tracking: #769.
… codes The "Design Review Feedback & Best Practices" section added in 9b132a7 introduced refinements that didn't fully align with the existing spec elsewhere in the document. Reconcile: - Phase 3.5 (Multi-node fan-out) step 6 previously said "Decrypt failure on any one node bricks the logical credential" — which directly contradicted Best Practice §2's retry-only-failed-nodes semantics. Rewrite steps 5 and 6 to describe the retry path explicitly: partial_decrypted is a recoverable state, frontend re-runs Phase 3 against the failed subset, consumed is reached when all N have accepted. - The test row e2e_multi_node_partial_failure was renamed to e2e_multi_node_partial_failure_then_retry and updated to assert the recovery path including idempotency of previously-accepted nodes. - Error code 8009 (previously "(reserved)") is now formally assigned to PendingCredentialQueueFull — the natural use case for the 5-per-node ciphertext queue cap described in Best Practice §3. - Best Practice §1: replace the hardcoded "1 hour" TTL with a reference to NodePendingCredential.expires_at so the local sweep stays aligned with the server-side TTL if the server default ever changes. - Best Practice §4: pin the mechanism to MFA confirmation explicitly, since NyxID already has the /auth/mfa/verify endpoint. Note why multi-admin approval is not used (no existing approval-queue subsystem; would be a separate project). All five edits are internal-consistency / clarification only — no protocol or threat-model changes.
Seven non-blocking review notes from Grok addressed: 1. Privkey sealing — replace the vague "crypto/aes.rs style" reference with a concrete cross-reference to the existing cli/src/node/secret_backend.rs::SecretBackend trait, naming the sibling methods (store_auth_token, store_signing_secret, store_credential_value) so the Phase 2 implementer has no ambiguity about which mechanism to reuse. 2. WSS frame classification — add a one-paragraph clarification that the new pending_credential_pubkey / pending_credential_ciphertext frames are node-control protocol traffic (same class as node_metrics, proxy_request, etc.) and bypass the ws_frame_injections rules on DownstreamService / UserService. Implementation note points to node_ws.rs vs ws_frame_injector.rs. 3. Standalone credential-accept page — add a new subsection to Phase 4.5 specifying the page is served as minimal standalone HTML (strict CSP, no main SPA bundle, ~5 KB form shim) to reduce the substitution surface, and that the page's HTML hash is itself part of the signed release manifest. 4. Phase 4.5 runbook items — new table enumerating the infra questions that must be answered before the phase starts: GPG signing key holder, rotation procedure, releases.nyxid.dev hosting, manifest publish pipeline, signature scheme, expiry. Implementer documents the chosen answers in docs/RELEASE_INTEGRITY.md. 5. AGENTS.md vs CLAUDE.md correction — Grok caught a factual error: the error-code listing lives at AGENTS.md:85, not in CLAUDE.md. Phase 1 update target corrected. 6. Strip GSTACK REVIEW REPORT table — remove the process artifact from the committed ADR (eng-review summary moves to PR description and commit history). Keeps the ADR a pure design doc for future readers. 7. Browser crypto floor — sharpen the open question to call for real admin-population analytics before Phase 3, with a concrete metric format and a lazy-load suggestion if SubtleCrypto coverage is overwhelmingly high. No protocol or threat-model changes. All edits are clarifications, operational hardening, and doc hygiene.
…s-model audit Two background agents (codex/gpt-5.5 and opencode/glm-5.1) independently reviewed PR #822 and surfaced findings the /plan-eng-review missed. Cross-model agreement on the P0; the other 13 items are confirmed against actual code. P0 — Error code collision (both agents): The ADR reserved 8004-8009 for remote credential injection but backend/src/errors/mod.rs:374-375 already assigns 8004 to NodeCredentialMissing and 8005 to WsProxyDownstream. Implementing the ADR as written would have either duplicated public error codes or required a breaking renumber of live errors. Shift the RCI range to 8006-8011 with the following assignments: 8006 PendingCredentialDecryptFailed 8007 PendingCredentialVersionUnsupported 8008 PendingCredentialCiphertextTooLarge 8009 PendingCredentialPubkeyAwaiting (renamed from PubkeyNotPosted to reflect the polling-with-backoff semantics, not a terminal "not posted" state) 8010 PendingCredentialNodeOffline 8011 PendingCredentialQueueFull Test rows and failure mode table updated to reference the new numbers. P1 — T1 threat-model overclaim (codex): G6 and the T1 row claim protection against a fully-compromised NyxID server via Phase 4.5. But the standalone HTML, SRI hashes, fingerprint display, and verify button are all served by NyxID itself. A fully-compromised server can substitute them in lockstep. Downgrade T1 wording to make the operational dependency explicit: Phase 4.5 provides "detection assuming the admin verifies the fingerprint out-of-band against the signed manifest at a separate origin." Without admin verification, T1 degrades to T2 in practice. P2 — supported_features vs per-pending pubkey-readiness contradiction (codex): Section said "No timing guesses, no polling fallback" while error 8007 said "Frontend polls or falls back." Split into two distinct signals: feature detection (sync, cached, via supported_features set on the Node record) and per-pending pubkey readiness (async, polling with exponential backoff up to 30s, error code 8009 PendingCredentialPubkeyAwaiting). Feature detection is fast; polling is bounded to nodes that have already advertised crypto_v1. P2 — /ciphertext sync semantics underspecified (codex): Sequence diagram showed 4xx on node decrypt failure but didn't specify how the HTTP handler waits for the node. Reuse the existing send_credential_update_and_wait pattern at node_ws_manager.rs:1558 (allocate request_id, register oneshot waiter, send WSS frame, await ack with timeout). New section "Sync vs async response semantics" specifies the status-code mapping for consume / decrypt_failed / timeout / offline. P2 — Frame name correction (codex): Sequence diagram used pending_credential_available (singular, with metadata). Actual code at ws_client.rs:433 sends pending_credentials_available (plural, no metadata) as a nudge; node pulls details via existing GET /api/v1/node-agent/pending-credentials. Updated the diagram to reuse the existing nudge/pull pattern with pending_credential_pubkey added as a new WSS frame the node posts per pending credential lacking a sealed privkey. P2 — Ciphertext encoding for JSON WSS (opencode): ciphertext: Vec<u8> with no encoding spec — JSON WSS frames need base64url. Added explicit doc-comment in the CryptoBundle struct noting the base64url-string-on-the-wire / BSON-binary-in-MongoDB serde-adapter pattern. P2 — supported_features doesn't exist yet (opencode): Phase 1 work; ADR was writing about it as if it already existed. Added "(new field, added in Phase 1)" annotation. P2 — AGENTS.md actual-state correction (opencode): The AGENTS.md bullet "Error codes 8000-8003 are reserved" is itself out of date. ADR now acknowledges this explicitly and tells implementer to update AGENTS.md to reflect 8000-8005 (existing) plus 8006-8011 (new RCI range). P3 cleanups: - findAndModify -> find_one_and_update (idiomatic for the Rust driver) - srl_* -> sri_* typo in test names (3 occurrences) - ChronoAIProject/issues/ -> ChronoAIProject/NyxID/issues/ in reference links - Per-pending rate limit mechanism specified (in-memory DashMap<PendingId, RateLimitState>, TTL-ephemeral, distinct from global RATE_LIMIT_PER_SECOND middleware) - Standalone HTML weight target loosened from ~5 KB to ~30 KB gzipped (more realistic given @noble/curves x25519 alone is ~6 KB minified) with lazy-load fallback recommendation when SubtleCrypto is available natively - Multi-node fan-out: explicit behavior when expires_at fires during partial_decrypted (logical credential goes to expired, previously-accepted node credentials are NOT rolled back per idempotency principle) No protocol changes. No threat-model changes (only one wording clarification for T1's operational dependency). All edits are internal-consistency, encoding/transport spec sharpening, and naming/reference corrections. Calibration note: /plan-eng-review missed the P0 because it trusted the AGENTS.md "8000-8003 reserved" line at face value instead of grepping backend/src/errors/mod.rs directly. Cross-model independent review with code-level verification caught it. Worth adopting as a default check before locking error-code ranges in future ADRs.
67fe444 to
8ddea80
Compare
…+ Grok) Three agents independently reviewed PR #822. Grok gave LGTM after exhaustive code cross-referencing; Codex and Opencode surfaced 14 specific items (2 P1, 5 P2, 5 P3, 2 informational). All addressed: P1 — Status field missing on model (Codex): CryptoBundle had no state tracking for the lifecycle the protocol describes. Add RemoteCryptoState enum (PubkeyPosted, CiphertextReceived, CiphertextQueued, Consumed, DecryptedPendingConfirmation, DecryptFailed, Expired) plus remote_state field on NodePendingCredential. For multi-node fan-out, add FanOutNodeState subdocument with per-node crypto + state. P1 — Offline handling contradicts sync POST (Codex): Error code 8010 said "queued" but sync section said 503. Resolved by supporting both: node online = sync wait (200/4xx/504 timeout); node offline at POST time = store ciphertext with 15-min queue TTL, return 202 with remote_state CiphertextQueued, forward on reconnect. The two paths are distinguished because the server knows node connectivity before attempting the WSS send. P1 — T1 + auto-accept tension (Opencode): Auto-accept means admin doesn't pause to verify fingerprints. Added security tradeoff note and recommendation for strict orgs. P2 — Feature flag visibility (Codex): Frontend checked crypto_v1 but couldn't tell if enable_remote_credential_injection was actually true. Now the detection section specifies three UI states: crypto-capable + enabled, crypto-capable + disabled (with enablement hint), and legacy. P2 — G6 + Phase 4.5 overclaim (Codex): G6 now says "Detection of malicious code-substitution... assuming admin independently verifies." Phase 4.5 intro aligned to match. P2 — 202 response novelty note (Opencode): Added RFC 7231 reference + explicit frontend handling note. P2 — Auto-accept migration warning (Opencode): One-time banner UX when feature first enabled. P3 — Primitives table encoding (Codex): "raw bytes" → "base64url" for ciphertext in JSON transports. P3 — WSS classification frame name (Codex): Singular pending_credential_available → plural pending_credentials_available. P3 — Failure table pubkey-awaiting (Codex): "legacy node" → "crypto_v1 node, async pubkey delay" with correct error code 8009 and polling semantics. P3 — Queue TTL vs metadata TTL (Opencode): Added one-sentence clarification of independence. P3 — Phase 3 effort estimate (Opencode): Bumped from 3d/4h to 4d/6h given expanded push form scope. Plus 4 new tests: RemoteCryptoState enum serde roundtrip, offline node returns 202 queued, queued ciphertext forwarded on reconnect, queued ciphertext expired after TTL. ADR is now 601 lines. Grok's LGTM + Codex/Opencode's corrections together give strong cross-model confidence in the design.
The CLI should support the entire remote credential injection flow
natively — not just "push intent then switch to browser." The admin's
locally-installed binary is immune to T1 code-substitution by design
(no NyxID-served JS to substitute), making it the strongest trust
anchor for security-conscious orgs.
New `nyxid node-credential inject` subcommand:
- Interactive mode: push + poll pubkey + prompt secret + encrypt +
post ciphertext + wait for result — one command, full flow
- Non-interactive mode: --secret-env reads from env var for CI /
cron / rotation scripts
- Org support: --org flag follows existing UUID/slug/name convention
- Fallback: existing `push` command now prints BOTH browser URL and
`inject --pending <id>` CLI command
Shared crypto crate: Phase 2 (node decrypt) and Phase 5 (CLI encrypt)
use the same x25519-dalek + chacha20poly1305 + hkdf + sha2 stack.
Extract into nyxid-crypto workspace crate (~400 LOC) following the
existing nyxid-cloud-auth precedent. Exposes encrypt() and decrypt()
with AAD context binding. Browser @noble/* and Rust crate must produce
identical output — interop fixtures verify this.
Phase 5 effort bumped from 2d/3h to 3d/4h to reflect the full inject
command + shared crate extraction. Test strategy adds 10 new tests:
5 for the shared crate, 5 for the inject command, plus 4 CLI-specific
e2e tests covering interactive, --secret-env, --org, and the push
fallback that prints both paths.
Security note: for orgs that care about T1, the CLI path is the
recommended approach — no SRI/fingerprint verification needed because
the binary itself is the trust anchor.
When an admin works alongside an AI coding agent (Claude Code, Codex, OpenClaw, etc.), the agent has full terminal visibility. A masked "Enter secret value:" prompt in interactive mode would still expose the secret through the terminal session the agent can read. New --browser flag on nyxid node-credential inject: 1. CLI creates the pending credential via the API 2. Opens the default browser to the standalone accept page 3. Admin enters the secret in the browser (agent cannot see it) 4. Browser encrypts + submits (Phase 3 e2e crypto) 5. CLI polls pending state until consumed/failed/expired 6. CLI prints result — secret never appears in terminal transcript Follows the existing CLI wizard pattern from CLI_WIZARD_V3.md, already used for OAuth device-code flows. The inject command now has three modes: - Interactive (default): terminal prompt, direct human use - --secret-env VAR: env var, CI / automation - --browser: wizard, AI-agent-safe The push fallback now prints all three continuation paths (browser URL, inject --pending, inject --pending --browser) so the admin can pick. Added 3 CLI tests (wizard opens URL + polls, secret not in transcript, push prints all paths) and 2 e2e tests (browser wizard flow, push prints all paths).
Round 3 review by Codex (gpt-5.5), Opencode (glm-5.1), and Grok
(grok-build). Grok gave LGTM again after delta review. Codex found
1 P1 + 2 P2 + 3 P3. Opencode found 2 P2 + 2 P3. All 9 addressed:
P1 — CLI T1 claim overstated (Codex):
CLI is immune to code-substitution but NOT to data-substitution.
The node pubkey fetched from NyxID can be replaced by a compromised
server (classic MITM on the key exchange). AEAD AAD binds metadata
but not the pubkey's origin. Downgrade the claim to "immune to T1
code-substitution; still vulnerable to T1 data-substitution like
the browser path." Add an operational note about out-of-band pubkey
fingerprint verification and a future-work pointer to node-signed
pubkeys via TOFU-pinned Ed25519 identity.
P2 — AAD too narrow (Codex):
AAD now includes injection_method, field_name, and target_url in
addition to the existing node_id, pending_id, slug, version. T5
threat row updated to reflect metadata-tampering protection. A
malicious relay that alters where the secret goes after submission
now causes an AEAD tag failure on the node.
P2 — PartialDecrypted missing from enum (Codex):
Added to RemoteCryptoState for the multi-node fan-out aggregate
state. Per-node states are in FanOutNodeState; the logical pending
now has an explicit state for "some nodes accepted, some failed."
P2 — 504 CiphertextReceived polling guidance (Opencode):
After a sync-wait timeout (504), the admin should poll with
exponential backoff watching for state transitions. Surface a
"node may be unresponsive" warning after 2 minutes.
P2 — CiphertextQueued in state diagram (Opencode):
State transition diagram now shows the offline → queued → reconnect
→ consumed/expired branch alongside the online path.
P3 — Consumed state tightened (Codex):
Comment now says "terminal success: decrypted AND stored." The
intermediate operator-confirm state is DecryptedPendingConfirmation,
not Consumed. Prevents frontend/cleanup confusion.
P3 — Interop test wording (Codex):
"identical ciphertext" → "mutually decryptable." With random nonces
identical plaintext will not produce identical ciphertext (by
design). Interop fixtures use fixed test vectors (pinned keys +
nonce) for byte-level verification.
P3 — node config set command (Codex):
The ADR referenced a `nyxid node config set` CLI subcommand that
doesn't exist. Updated to point at the web UI node settings page
and PATCH /api/v1/nodes/{id} as the enablement paths, with a note
that Phase 1 should add the CLI subcommand or the API + web UI are
the only paths.
P3 — Feature-gate decrypt() in nyxid-crypto (Opencode):
Backend depends on nyxid-crypto for CiphertextEnvelope + validation
only. decrypt() is behind cfg(feature = "decrypt") so the backend
binary cannot call it. Only nyxid-cli enables the feature.
…routes
Round 4 cross-model review (Codex + Opencode + Grok). Grok LGTM again.
Codex and Opencode found 6 text-consistency items (no protocol changes):
1. Top-level context + G2 now explicitly scope the confidentiality
claim to T2 (passive compromise) and reference the T1 pubkey-MITM
caveat. No more "strictly preserves" language without qualification.
2. T1 data-substitution caveat now lists the full 7-field AAD (was
listing only 4 — a stale reference from before the AAD expansion).
3. New section documenting why HKDF info (4 fields) and AEAD AAD
(7 fields) intentionally differ: HKDF for domain separation, AAD
for integrity. Includes the canonical wire format (length-prefixed
u16_be fields) and the Option<String> sentinel convention (None →
empty string → u16_be(0)) to prevent downgrade attacks.
4. require_operator_confirm_for_remote reframed as separation-of-duties
control, not T1 confidentiality defense. The gate runs after secret
submission — too late to prevent pubkey-MITM exfiltration.
5. API routes unified: CLI section now uses
/nodes/{node_id}/credentials/pending/{pending_id} (matching the
sequence diagram and existing routes.rs at lines 545-546), not the
shorter /credentials/pending/{id} which doesn't exist.
6. PartialDecrypted documented as fan-out aggregate-only state with
explicit MUST NOT appear note for FanOutNodeState.remote_state.
State diagram note added.
…ended in v1' Codex repeatedly flagged (rounds 3-5) that the threat table presented T1 and T3 as "defended" while the protocol text admitted the pubkey- substitution MITM defeats both. Opencode and Grok said ship, but Codex's structural point is valid: a security reviewer reads the threat table first, and "defended" there contradicts the caveats deeper in the doc. Split the table into two sections: "Adversaries defended in v1" — T2, T4, T5, T6, T7, T8. These are fully defended by the e2e crypto, AAD binding, atomicity, forward secrecy, rate limiting, and sealed privkey persistence. T2 (passive read) is now explicitly marked as the primary confidentiality guarantee. "Acknowledged threats — NOT defended in v1" — T1 and T3. Both share the same key-exchange MITM weakness. Each row now explains WHY it's not defended (pubkey substitution), what mitigation exists TODAY (out-of-band fingerprint verification), and what the future fix is (node-signed ephemeral pubkeys via TOFU-pinned Ed25519 identity). G6 narrowed from "detection of malicious code-substitution" to explicitly "JS code-substitution only, not pubkey substitution."
📊 当前状态 — ADR 独立安全审计已派(不需要人介入)
🤖 controller status banner ⟦AI:AUTO-LOOP⟧ |
🤖 ADR #822 Remote Credential Injection 独立安全/设计审计TL;DR
原始审计 artifact(English)# ADR PR #822 Remote Credential Injection - Independent Security/Design Audit
Audit date: 2026-06-04
Scope: ADR from `git show origin/docs/issue-773-remote-credential-injection:docs/REMOTE_CREDENTIAL_INJECTION.md`, issue #773, PR #822, and current NyxID checkout.
Mode: read-only source audit; no source code changes.
## Executive Verdict
The ADR is directionally sound for its main confidentiality invariant under the stated passive-compromise model: the remote path places encryption in the admin browser/CLI and decryption on the node, while the server stores/forwards `admin_pubkey`, `nonce`, and ciphertext only (`ADR §"Goals"` lines 23-27, `ADR §"Data model"` lines 181-203, `ADR §"Shared crypto crate"` line 620). The document is also materially more honest than earlier versions: T1 and T3 are explicitly moved to "NOT defended in v1" because the server can substitute the node pubkey (`ADR §"Threat model"` lines 51-56, `ADR §"T1 data-substitution caveat"` line 97).
I found no design path where NyxID intentionally receives plaintext or derives the shared key under T2, assuming the backend keeps the `decrypt` feature disabled as specified (`ADR §"Shared crypto crate"` line 620). The current code also supports the baseline claim that today's pending credential flow stores metadata only: `NodePendingCredential` has no secret field (`backend/src/models/node_pending_credential.rs:26-47`), admin push logs only metadata (`backend/src/handlers/node_admin.rs:589-599`), and node consume/decline audit events are metadata-only (`backend/src/handlers/node_agent.rs:93-107`, `backend/src/handlers/node_agent.rs:127-145`).
However, I do not consider the ADR fully implementation-ready until the offline queue TTL is represented in the data model. The ADR requires a 15-minute ciphertext queue TTL independent from `NodePendingCredential.expires_at`, but the proposed model has no single-node queued-at / queued-expires-at field to enforce that requirement (`ADR §"Data model"` lines 181-255, `ADR §"Sync vs async response semantics"` line 377, `ADR §"Design Review Feedback & Best Practices"` lines 707-709).
## Findings Table
| ID | Severity | Finding | Evidence | Recommendation |
|---|---|---|---|---|
| F1 | blocking | Offline ciphertext queue TTL is required but not representable in the proposed single-node data model. The ADR says queued ciphertext uses a shortened 15-minute TTL independent from metadata `expires_at`, but `CryptoBundle` has only `version`, pubkeys, nonce, and ciphertext, while `NodePendingCredential` only adds `crypto`, `remote_state`, and optional `fan_out_nodes`. The current model likewise has no `updated_at` or queue-expiry field. Without a persisted queue timestamp/expiry, a sweeper cannot distinguish "queued 15 minutes ago" from "pending metadata still valid for 1 hour." | `ADR §"Data model"` lines 181-203 and 237-255; `ADR §"Sync vs async response semantics"` line 377; `ADR §"Design Review Feedback & Best Practices"` lines 707-709; `backend/src/models/node_pending_credential.rs:26-47` | Add explicit queue lifecycle fields, e.g. `ciphertext_queued_at` and `ciphertext_expires_at` on single-node pending credentials and per fan-out node, using the required BSON chrono serde helpers. Define the cleanup query/index and reconnect-forward rule against those fields. |
| F2 | high | Multi-node fan-out target selection is underspecified and conflicts with current `NodeRoute` semantics if implementers reuse `fallback_node_ids`. The ADR says fan-out applies to services with `fallback_node_ids`, but the current routing service only returns viable online/WS-connected nodes as primary/fallbacks. That is correct for proxy failover, but wrong as a credential fan-out source because offline nodes are exactly the nodes that need queued ciphertext. | `ADR §"Multi-node fan-out"` lines 349-359; `backend/src/services/node_routing_service.rs:21-24`, `backend/src/services/node_routing_service.rs:154-165`, `backend/src/services/node_routing_service.rs:129-149` | Specify a separate fan-out target resolver: all intended active node bindings for the service/owner, with per-node ACL checks, not the viability-filtered `NodeRoute`. Then apply online sync delivery or offline queue per target. |
| F3 | medium | T7 overstates pending ID entropy. The ADR says pending IDs have approximately 256 bits of entropy, but current pending IDs are UUID v4 strings generated by `Uuid::new_v4()`, which provide about 122 random bits. That is still practically unguessable, but the ADR's evidence is inaccurate. | `ADR §"Threat model"` line 48; `backend/src/services/node_pending_credential_service.rs:60-62` | Correct the ADR to "~122 bits via UUID v4" or explicitly switch RCI pending IDs to 32-byte random identifiers if 256-bit entropy is a hard requirement. |
| F4 | medium | Base64url encoding is not pinned to padded vs unpadded form. The ADR says "base64url" for keys, nonce, and ciphertext, while the current codebase uses both URL-safe no-padding encodings for JWT-like material and standard base64 for WSS binary payloads. Interop fixtures will catch this late, but implementers need the canonical rule up front. | `ADR §"Cryptographic primitives"` line 170; `ADR §"Data model"` lines 196-200; `cli/src/auth.rs:7`, `cli/src/auth.rs:47`; `backend/src/handlers/node_ws.rs:149`; `cli/src/node/ws_client.rs:1088` | Specify `base64::engine::general_purpose::URL_SAFE_NO_PAD` for all RCI JSON fields, or specify an accepting decoder plus canonical encoder. Include exact decoded lengths: 32-byte pubkeys, 24-byte nonce, `<= 16 * 1024` ciphertext+tag. |
| F5 | medium | Crypto context canonicalization needs to be centralized. The ADR binds `injection_method` in AAD but does not define the canonical string source. Current pending credential APIs use kebab-case (`query-param`, `path-prefix`), while the live WSS `credential_update` apply path uses underscore tokens (`query_param`, `path_prefix`). The remote path can be correct, but only if all encrypt/decrypt sides use the same canonical pending-credential tokens. | `ADR §"HKDF info vs AEAD AAD"` lines 174-177; `backend/src/models/node_pending_credential.rs:16-22`; `cli/src/cli.rs:1555-1561`; `cli/src/node/ws_client.rs:3007-3010`, `cli/src/node/ws_client.rs:3036-3037`, `cli/src/node/ws_client.rs:3073-3074` | Put `RciAadContext` / `RciKdfInfo` builders in the shared crypto crate or a shared protocol module. Use explicit static domain labels such as `nyxid:rci:v1:kdf` and `nyxid:rci:v1:aad`, and define `injection_method` as the pending API's kebab-case token or an enum discriminant. |
| F6 | medium | The 16 KB ciphertext cap is specified for POST, but not explicitly for WSS forward, queued replay, or fan-out aggregate bodies. Current WSS writer paths only serialize and `try_send`; the ADR's tests only name the HTTP over-size rejection. A future bug or queue replay path could forward an oversized stored blob unless the same constant is checked at every boundary. | `ADR §"Data model"` line 194; `ADR §"Flow"` line 319; `ADR §"Error codes"` line 494; `ADR §"Test Strategy"` line 545; `backend/src/services/node_ws_manager.rs:1537-1540`, `backend/src/services/node_ws_manager.rs:1583-1591` | Define one `MAX_CIPHERTEXT_SIZE` and require checks on HTTP decode, MongoDB update, WSS frame construction, queued replay, and each fan-out element. Add tests for queued replay and fan-out total/per-node limits. |
| F7 | medium | Cancel/decline/expire key eviction is required but the control-plane message is not specified. The ADR requires sealed privkey eviction on consume/decline/cancel/expire and mentions missed cancel WSS frames, but the current admin cancel endpoint only deactivates MongoDB state and writes audit metadata; it does not notify the node. | `ADR §"Test Strategy"` lines 514-517; `ADR §"Design Review Feedback & Best Practices"` lines 699-701; `backend/src/handlers/node_admin.rs:637-664`; `backend/src/services/node_pending_credential_service.rs:127-150` | Add explicit node-control frames such as `pending_credential_canceled` / `pending_credential_expired`, or specify a polling diff protocol that causes deletion. Keep local sweep as a fallback, not the only cancel cleanup path. |
| F8 | medium | The operator-confirm path creates a plaintext-on-node memory queue, but the ADR only says "volatile ready-to-accept queue with TTL" and does not define TTL length, cancellation behavior, or zeroization. This is on the trusted node, not on NyxID, so it does not break the core server invariant; it is still key lifecycle surface that should be deterministic. | `ADR §"Flow"` lines 332-335; `ADR §"Accept gate"` lines 418-420; current node secret handling uses explicit secret backends at `cli/src/node/secret_backend.rs:121-152` | Specify the queue TTL, upper-bound it by `NodePendingCredential.expires_at`, use `Zeroizing` storage for plaintext buffers, and state that cancel/decline/expire/drop paths zeroize the queue entry before removal. |
| F9 | low | The SRI section has one self-contained overstatement. It says NyxID cannot silently substitute JS without changing the HTML's SRI attribute loaded over TLS. Under the ADR's own T1 model, a fully compromised server can also change that HTML; detection depends on the separate signed manifest and out-of-band admin verification. Other sections say this correctly, so this is a wording fix, not a protocol flaw. | `ADR §"Code-integrity infrastructure"` line 430; `ADR §"SRI hashes on crypto JS bundles"` line 443; `ADR §"Signed release channel"` lines 447-452; `ADR §"Admin verification UX"` lines 456-461 | Qualify the SRI paragraph: "SRI blocks stale/tampered bundles unless the HTML is also substituted; T1 detection requires comparing the HTML/bundle hash to the signed release manifest at the separate origin." |
| F10 | nit | Error-code documentation drift exists in both `AGENTS.md` and `CLAUDE.md`, but the ADR only calls out `AGENTS.md`. The code already assigns 8000-8005, while both docs still say 8000-8003. | `ADR §"Error codes"` lines 488-499; `backend/src/errors/mod.rs:374-379`; `AGENTS.md:85`; `CLAUDE.md:85` | Update both docs or make one explicitly generated/authoritative. The ADR's 8006-8011 allocation itself does not collide with current code. |
## Core Invariant Review
Under T2/passive compromise, the ADR's plaintext boundary is credible. The browser/CLI encrypts before POST (`ADR §"Browser-based push"` lines 85-91, `ADR §"CLI path"` lines 107-114), the server stores only the `CryptoBundle` envelope (`ADR §"Data model"` lines 181-203), and the backend is explicitly forbidden from enabling `decrypt` in the shared crate (`ADR §"Shared crypto crate"` line 620). Offline queuing stores ciphertext server-side, not plaintext (`ADR §"Sync vs async response semantics"` line 377).
Under T1/active compromise, the ADR is now honest: the server can replace the node pubkey and MITM the exchange (`ADR §"Threat model"` lines 51-56, `ADR §"T1 data-substitution caveat"` line 97). Phase 4.5 detects JS substitution only when the admin verifies the signed manifest out-of-band (`ADR §"Goals"` line 27, `ADR §"Code-integrity infrastructure"` line 430). That is a correct v1 boundary.
I found no current code path that contradicts the legacy invariant. The existing admin `node-credential push` posts metadata only (`cli/src/commands/node_credential.rs:23-32`) and prints "Do not send the secret value" (`cli/src/commands/node_credential.rs:48-54`). The node-side accept command prompts/stores locally, then marks consumed through the node-agent API (`cli/src/node/agent.rs:466-505`). The existing model stores pending metadata only (`backend/src/models/node_pending_credential.rs:26-47`).
## Crypto Envelope Review
The primitive choices are reasonable: X25519, HKDF-SHA256, and XChaCha20-Poly1305 are appropriate for an ephemeral ECDH envelope (`ADR §"Cryptographic primitives"` lines 162-170). Random 24-byte nonces for XChaCha20-Poly1305 are suitable if generated per encryption and covered by interop/unit tests (`ADR §"Test Strategy"` lines 519-527).
The main crypto-design improvements are not primitive changes; they are canonicalization controls. The KDF/AAD encoding should include fixed protocol labels and be generated by a shared builder, not handwritten in frontend, CLI, and node code (`ADR §"HKDF info vs AEAD AAD"` lines 174-177). Base64url must be pinned to padded or unpadded form (`ADR §"Cryptographic primitives"` line 170). These are medium-severity interop and future-proofing findings, not reasons to abandon the design.
## Race, DoS, And Queue Review
The first-push-wins plan is correct in shape: `find_one_and_update` with a null-ciphertext guard is the right MongoDB primitive (`ADR §"Race protection"` line 363). The strict sync ACK requirement also matches existing code: `send_credential_update_and_wait` allocates a request ID, registers a oneshot waiter, sends over WSS, and maps timeout/drop/negative ACK to node-failure errors (`backend/src/services/node_ws_manager.rs:1551-1633`).
The DoS/resource gaps are in queue representation and cap placement. The 15-minute queue TTL is not representable in the proposed single-node model (F1). The 16 KB cap should be repeated at all ingress and egress boundaries, not only HTTP POST (F6). The per-pending in-memory rate limiter is acceptable as a local throttle, but in multi-instance deployments it is not a global guarantee; the current node agent already acknowledges multi-instance WSS worker split behavior (`cli/src/node/ws_client.rs:99-101`). If NyxID deploys multiple backend workers, Phase 1 should document whether the rate limiter is best-effort per worker or move it to MongoDB/Redis.
## Backward Compatibility Review
The ADR preserves legacy `crypto: None` behavior (`ADR §"Backward compatibility & version detection"` lines 383-403, `ADR §"Accept gate"` lines 416-420). That aligns with the current split between laptop-side `nyxid node-credential push` and node-side `nyxid node credentials accept`: user-side push creates metadata only (`cli/src/commands/node_credential.rs:23-32`), node-side accept prompts for the secret and stores locally (`cli/src/node/agent.rs:466-505`, `cli/src/node/agent.rs:1331-1363`). I did not find a forced migration path in the ADR.
## Error-Code Review
The ADR's 8006-8011 range is non-overlapping with current code. Current node errors occupy 8000-8005 (`backend/src/errors/mod.rs:374-379`), and the ADR correctly reserves the next six slots (`ADR §"Error codes"` lines 486-497). The doc cleanup should update both `AGENTS.md:85` and `CLAUDE.md:85`, not only AGENTS.
## Recommended Gate Before Implementation
Fix F1 before treating the ADR as implementation-ready. F2 should be resolved before Phase 3.5 starts, but it does not block Phase 1 single-node stubs if the fan-out endpoint is explicitly deferred. F3-F8 should be folded into Phase 1/2 acceptance criteria because they are cheapest to fix at protocol-boundary time.⟦AI:AUTO-LOOP⟧ |
…ases 1–6 rollup) (#895) * feat(backend): #773 Phase 1 远程凭证注入后端协议骨架 (#873) 错误码 8006-8011 + CryptoBundle/RemoteCryptoState + 离线队列生命周期字段(审计 F1)+ redacting Debug(CLAUDE §5)+ 两个 partial 索引/sweep + node-error 真值同步。design-consensus #872 + 独立安全审计 #822 + 3-reviewer 双轮。Refs #773 * feat(crypto): #773 Phase 2 node 端 RCI crypto + nyxid-crypto crate (#877) CLI/node-owned nyxid-crypto(X25519/HKDF/XChaCha20)+ node sealed-key 生命周期 + CI dependency-graph invariant 证明 backend 零 crypto 依赖(backend LOC=0)。design-consensus #874(r4)+ 独立 verify + 2 轮 3-reviewer。backend WS 路由见 Phase 2.5 #876。Refs #773 * feat(backend): #773 Phase 2.5 RCI backend WS 帧路由(非 crypto) (#879) existing-service glue;backend 零 crypto 依赖(CI-enforced);async 202;GET pubkey/POST ciphertext/WS frame/reconnect drain;删 DecryptedPendingConfirmation;ACL 负向 + 行为测试。design-consensus #876 r6(supervisor 覆盖误判 drop)。Refs #773 * feat(audit): #773 Phase 4 — RCI metadata-only audit + observability (#881) Typed RciAudit* interface; metadata-only audit for all 18 node_credential_rci_* emission branches with read-back coverage; single error-code taxonomy source; backend stays zero-crypto (boundary gate passes); legacy crypto:None unchanged. Closes #878. Reviews: architect comment, tests approve, quality comment (advisory). ⟦AI:AUTO-LOOP⟧ * feat(frontend): #773 Phase 3 — browser WebCrypto RCI push + accept (#882) Browser @noble E2E-encrypted remote credential injection: lib/crypto (X25519+HKDF+XChaCha20-Poly1305, base64url no-pad), metadata-only RHF+Zod push forms (remote_crypto literal-true), independent credential-accept page (uncontrolled secret, browser-only encrypt, posts only CiphertextEnvelope), interop fixture mutual-decryptability. return_to open-redirect hardened via shared trusted-origin guard. Backend + nyxid-crypto zero diff. Closes #880. Reviews: architect approve (security re-review) / tests approve / quality approve. ⟦AI:AUTO-LOOP⟧ * feat(node): #773 Phase 3.5 — multi-node credential fan-out (#884) Embedded fan_out_nodes + revision-guard atomic N-ciphertext submit; separate offline-aware CredentialFanOutTargetResolver with per-node ACL (no NodeRoute coupling); all-N/partial_decrypted/retry-failed/partial->expired; F6 per-element+aggregate+MAX_FAN_OUT_TARGETS caps; metadata-only fan-out audit via typed rci_audit_service; route-level success + boundary coverage. Backend zero-crypto (boundary exit 0); single-node Phase 3 + legacy crypto:None byte-unchanged. Closes #883. Reviews: architect approve / tests approve / quality comment. ⟦AI:AUTO-LOOP⟧ * feat(security): #773 Phase 4.5 — code-integrity (standalone accept page + SRI + admin verify) (#887) Standalone hardened credential-accept page via zero-crypto backend handler (strict CSP + SHA-384 SRI, replaces Phase 3 SPA accept); admin fingerprint-verify UX; releases.json manifest + byte-identical browser/Node fingerprint canonicalization; vite plugin preserves data-nyx-integrity-role + backend 503 fail-closed; release workflow disabled=>warn / enabled+complete=>sign / enabled+incomplete=>fail-fast (no fabricated infra); docs/RELEASE_INTEGRITY.md 6-item runbook + reachability; service-layer policy; Zod-validated runtime-config. Backend zero-crypto; single-node+fan-out(retry)+legacy crypto:None byte-unchanged; integrity_verification audit-only. Closes #885. Reviews: architect approve / tests approve / quality comment. ⟦AI:AUTO-LOOP⟧ * feat(cli): #773 Phase 5 — CLI inject (nyxid node-credential inject) (#891) Admin-side full-e2e remote credential injection via the local nyxid binary (T1 code-substitution-immune): push remote_crypto -> poll pubkey -> encrypt locally (reusing the Phase 2 shared RciContext, byte-identical to node decrypt) -> post ciphertext. interactive/--secret-env/--browser modes; strict single-format pubkey fingerprint helper shared with the node-agent console for out-of-band verification; Zeroizing secret never logged; --browser reads no secret; inject --pending metadata-only-inits legacy pendings (no backend crypto); push default unchanged; fan-out inject deferred. Backend stays ZERO crypto (boundary exit 0). Closes #889. Reviews: architect approve / tests approve / quality approve. ⟦AI:AUTO-LOOP⟧ * docs(cli): #773 Phase 6 — RCI discovery hint rewrites (#894) Text-only CLI hint rewrites in node-credential + service commands so agents/admins discover the full RCI flow (push metadata-only -> inject/--browser/--secret-env/accept page, out-of-band --verify-fingerprint, org opt-out, error codes 8006-8011) with honest detection-only/T1 framing. Shared RciCliHintLines renderer; tests assert exact merged commands + reject phantom flags. No behavior/flag/crypto change; backend/nyxid-crypto/frontend untouched; boundary exit 0. Closes #892. Reviews: architect approve / tests approve / quality comment. FINAL phase of the #773 remote-credential-injection feature. ⟦AI:AUTO-LOOP⟧ * fix(rci): resolve clippy lints under current stable for #773 rollup (#895) CI runs clippy against unpinned latest-stable (rustc 1.96); these lints were clean under the older stable the phase PRs ran with. All in RCI code: - node_credential.rs: collapse nested if-let via .filter() (production poll loop) - credential_accept.rs: let-else -> ? in SRI script-tag parser (production) - node_credential.rs tests: #[allow(clippy::await_holding_lock)] on 4 env_lock single-thread tests (env mutation serialized; held across await by design) - node_pending_credential_service.rs tests: &[x.clone()] -> std::slice::from_ref(&x) Verified: cargo +stable clippy --workspace --all-targets -- -D warnings (1.96) clean; backend zero-crypto boundary unaffected (test/lint-only + 2 trivial prod rewrites). * style(rci): cargo fmt the slice::from_ref test call wrapping (#895) Pure rustfmt formatting of the 3 insert_fan_out_pending test calls touched by the prior clippy fix; no logic change. * docs(skill): #773 document remote credential injection in nyxid skill (#895) The 8 RCI phases shipped `nyxid node-credential inject` (browser/laptop admin-side end-to-end encryption) but never updated the agent-facing skill. - references/nodes.md: new 'Remote credential injection' subsection — inject (interactive/--secret-env/--browser), out-of-band fingerprint verification, secret-never-on-argv, browser accept page (SRI/signed manifest; detection-not-prevention), web-UI multi-node fan-out, and error codes 8006-8011. - SKILL.md: add inject / remote-injection triggers to the nodes reference map row; add an end-to-end-encryption bullet to Security and Privacy. Docs only; no version bump (release process owns skill versioning).
Summary
Design document for #773 — letting org admins supply secrets to node-managed services from any device with a browser, with no SSH-to-node required, while strictly preserving the existing "secret never on NyxID server" invariant.
Status: proposed, eng-reviewed via
/plan-eng-reviewwith 9 issues raised and resolved, 0 unresolved, 0 critical gaps.What the ADR specifies
Browser-driven end-to-end encrypted relay:
(node_id, pending_credential_id, slug, version)binding contextfallback_node_ids— same protocol per node, all-N semantics, per-node partial-state breakdownEng review summary
5 architecture issues + 2 code quality + 2 performance, all resolved. Highlights of the resolutions baked into the ADR:
supported_featuresadvertisement on WSS handshake (definitive feature detection, not timeout polling)CryptoBundletyped sub-document onNodePendingCredentialinstead of four loose fieldsSee the doc's "GSTACK REVIEW REPORT" section at the end for the full record.
Implementation phases
accept-remotesubcommand for headless rotation)service show/node-credential pushWorktree parallelization mapped in the doc — Phase 1 alone first, then 2 + 3 + 4 in parallel lanes, then 3.5, then 4.5 + 5 + 6 in parallel.
What's NOT in scope (for the implementation that follows this ADR)
Test plan for this PR
This PR is a single Markdown file. There's no code to test — but the ADR itself is the contract for everything that ships under #773. Reviewer's job: read the doc end-to-end, push back on anything that looks wrong, approve the protocol shape so implementation can start.
Related
nyxid node credentialsexists on addon-side runtime, not user-side CLI #775 (PR fix(cli): clarify node credential hints (#775) #805), [Feature] node-credential push: allow internal target_url for node-side reachability #770 (PR fix(backend): allow node-local credential target urls (#770) #815), [Feature] Surface stale-node binding when proxy returns node_offline #771 (PR fix(services): surface stale node binding in service show/list (#771) #819), [UX] Explain credential_type=node_managed inline in service show / list #772 + [Feature] node show should expose connection source IP / last-seen-from for admin debugging #774 (PR feat(cli): inline hints for service show + node show (#772, #774) #821)