docs(adr): remote credential injection design for #773 (eng-reviewed) by kaiweijw · Pull Request #822 · ChronoAIProject/NyxID

kaiweijw · 2026-05-22T07:01:32Z

Summary

Design document for #773 — letting org admins supply secrets to node-managed services from any device with a browser, with no SSH-to-node required, while strictly preserving the existing "secret never on NyxID server" invariant.

Status: proposed, eng-reviewed via /plan-eng-review with 9 issues raised and resolved, 0 unresolved, 0 critical gaps.

What the ADR specifies

Browser-driven end-to-end encrypted relay:

X25519 ephemeral keypairs per pending credential (node-side persisted, sealed by long-lived auth key — survives restarts, dropped on consume)
HKDF-SHA256 key derivation with (node_id, pending_credential_id, slug, version) binding context
XChaCha20-Poly1305 AEAD with AAD binding to prevent cross-credential replay
NyxID is strictly a passthrough relay — never sees plaintext, never derives the shared key
Frontend code-integrity (Phase 4.5): SRI hashes on crypto bundles, signed release channel at a separate domain, admin verification UX with fingerprint check
Multi-node fan-out (Phase 3.5) for services with fallback_node_ids — same protocol per node, all-N semantics, per-node partial-state breakdown
Manual-accept gate is the default — preserves today's separation-of-duties; auto-accept is a second opt-in

Eng review summary

5 architecture issues + 2 code quality + 2 performance, all resolved. Highlights of the resolutions baked into the ADR:

Persistent sealed privkey on node (survives restart) instead of in-memory-only
T1 ("malicious server") kept broad with Phase 4.5 code-integrity controls instead of narrowing the threat model
Per-node opt-in for the new flow; manual-accept default
16 KB ciphertext cap rejected at both HTTP layer and WSS forward
supported_features advertisement on WSS handshake (definitive feature detection, not timeout polling)
CryptoBundle typed sub-document on NodePendingCredential instead of four loose fields
Reserved error codes 8004–8009
First-push-wins atomicity + per-pending rate limit (1 successful / 3 failed / 60s window / 5min lockout)
Explicit Test Strategy section: 25 tests enumerated by layer, including cross-language interop fixtures

See the doc's "GSTACK REVIEW REPORT" section at the end for the full record.

Implementation phases

Phase	Scope	Effort (human / CC)
1	Backend protocol stubs + error codes	1d / 2h
2	Node crypto + sealed privkey persistence	3d / 4h
3	Frontend UI + browser crypto	3d / 4h
3.5	Multi-node fan-out (per-node ephemeral pubkeys, all-N semantics)	2d / 4h
4	Audit + observability (metadata-only events)	1d / 2h
4.5	Code-integrity infra: SRI + signed release channel + verification UX	1w / 1d
5	CLI parity (`accept-remote` subcommand for headless rotation)	2d / 3h
6	Hint rewrites in `service show` / `node-credential push`	<1d / 30m

Worktree parallelization mapped in the doc — Phase 1 alone first, then 2 + 3 + 4 in parallel lanes, then 3.5, then 4.5 + 5 + 6 in parallel.

What's NOT in scope (for the implementation that follows this ADR)

HSM-backed node keypairs
Mobile native client (mobile browser works)
Replacing the legacy CLI two-party flow (kept additive)
Verifying decrypted secrets against downstream services pre-accept

Test plan for this PR

This PR is a single Markdown file. There's no code to test — but the ADR itself is the contract for everything that ships under #773. Reviewer's job: read the doc end-to-end, push back on anything that looks wrong, approve the protocol shape so implementation can start.

Threat model T1–T8 is honest and matches what the protocol actually defends
Protocol primitives (X25519, HKDF-SHA256, XChaCha20-Poly1305) are appropriate
Multi-node fan-out semantics make sense for ChronoAI's actual multi-node setup
Phase 4.5 (code-integrity) is realistic given the team's release-pipeline state
Test Strategy enumeration is sufficient for implementer to start without ambiguity
Open Questions in the doc don't block ADR sign-off (they're implementation-phase questions)

#	Note	Resolution
1	Long-lived sealing key cross-ref	Replaced "crypto/aes.rs style" with concrete reference to `cli/src/node/secret_backend.rs::SecretBackend` and its sibling methods (`store_auth_token`, `store_signing_secret`, `store_credential_value`). No ambiguity for the Phase 2 implementer.
2	WSS frame classification	New §"WSS frame classification" clarifies the new frames are node-control protocol (same class as `node_metrics`, `proxy_request`) and bypass `ws_frame_injections` rules. Implementer points at `node_ws.rs` not `ws_frame_injector.rs`.
3	Standalone credential-accept page	New Phase 4.5 subsection: strict CSP, no main SPA bundle, ~5 KB form shim, page HTML hash itself in the signed manifest.
4	`releases.nyxid.dev` runbook items	New table enumerating 6 infra questions to answer pre-Phase-4.5 (GPG key holder, rotation, hosting, manifest pipeline, signature scheme, expiry). Implementer documents chosen answers in `docs/RELEASE_INTEGRITY.md`.
5	AGENTS.md vs CLAUDE.md	Factual error caught. Error code listing actually lives at `AGENTS.md:85`, not CLAUDE.md. Phase 1 update target corrected.
6	Strip GSTACK REVIEW REPORT	Removed from committed ADR. Eng-review audit trail preserved here in the PR (commit `9e4414b` body + this thread) and in git history. ADR is now pure design content.
7	Browser crypto floor	Sharpened Open Question: sample real admin-population analytics before Phase 3, report `X% native / Y% noble-polyfill`, consider lazy-loading polyfill on miss.

Eng-review audit trail (moved from the ADR per item 6)

The ADR was eng-reviewed via /plan-eng-review on this branch: 9 issues raised, 9 resolved, 0 unresolved, 0 critical gaps. Findings + decisions are captured in the commit history:

9e4414b docs(adr): remote credential injection design for #773 (eng-reviewed) — initial draft with all 9 resolutions baked in
9b132a7 docs(adr): add review feedback and best practices to remote credential injection ADR — reviewer additions (local sweep, fan-out idempotency, queue limits, MFA gate)
d10f7d6 docs(adr): reconcile review-feedback section with phase 3.5 and error codes — internal consistency pass
85b4837 docs(adr): apply Grok review notes to remote credential injection ADR — this commit

3 scope expansions were folded during review: Phase 3.5 (multi-node fan-out), Phase 4.5 (code-integrity infrastructure including the new standalone-page subsection), and an explicit Test Strategy section. Eng review verdict: CLEAR — ready to implement.

Final ADR is 472 lines. Reviewers: read the file end-to-end and push back on anything that looks wrong before implementation starts.

…s-model audit Two background agents (codex/gpt-5.5 and opencode/glm-5.1) independently reviewed PR #822 and surfaced findings the /plan-eng-review missed. Cross-model agreement on the P0; the other 13 items are confirmed against actual code. P0 — Error code collision (both agents): The ADR reserved 8004-8009 for remote credential injection but backend/src/errors/mod.rs:374-375 already assigns 8004 to NodeCredentialMissing and 8005 to WsProxyDownstream. Implementing the ADR as written would have either duplicated public error codes or required a breaking renumber of live errors. Shift the RCI range to 8006-8011 with the following assignments: 8006 PendingCredentialDecryptFailed 8007 PendingCredentialVersionUnsupported 8008 PendingCredentialCiphertextTooLarge 8009 PendingCredentialPubkeyAwaiting (renamed from PubkeyNotPosted to reflect the polling-with-backoff semantics, not a terminal "not posted" state) 8010 PendingCredentialNodeOffline 8011 PendingCredentialQueueFull Test rows and failure mode table updated to reference the new numbers. P1 — T1 threat-model overclaim (codex): G6 and the T1 row claim protection against a fully-compromised NyxID server via Phase 4.5. But the standalone HTML, SRI hashes, fingerprint display, and verify button are all served by NyxID itself. A fully-compromised server can substitute them in lockstep. Downgrade T1 wording to make the operational dependency explicit: Phase 4.5 provides "detection assuming the admin verifies the fingerprint out-of-band against the signed manifest at a separate origin." Without admin verification, T1 degrades to T2 in practice. P2 — supported_features vs per-pending pubkey-readiness contradiction (codex): Section said "No timing guesses, no polling fallback" while error 8007 said "Frontend polls or falls back." Split into two distinct signals: feature detection (sync, cached, via supported_features set on the Node record) and per-pending pubkey readiness (async, polling with exponential backoff up to 30s, error code 8009 PendingCredentialPubkeyAwaiting). Feature detection is fast; polling is bounded to nodes that have already advertised crypto_v1. P2 — /ciphertext sync semantics underspecified (codex): Sequence diagram showed 4xx on node decrypt failure but didn't specify how the HTTP handler waits for the node. Reuse the existing send_credential_update_and_wait pattern at node_ws_manager.rs:1558 (allocate request_id, register oneshot waiter, send WSS frame, await ack with timeout). New section "Sync vs async response semantics" specifies the status-code mapping for consume / decrypt_failed / timeout / offline. P2 — Frame name correction (codex): Sequence diagram used pending_credential_available (singular, with metadata). Actual code at ws_client.rs:433 sends pending_credentials_available (plural, no metadata) as a nudge; node pulls details via existing GET /api/v1/node-agent/pending-credentials. Updated the diagram to reuse the existing nudge/pull pattern with pending_credential_pubkey added as a new WSS frame the node posts per pending credential lacking a sealed privkey. P2 — Ciphertext encoding for JSON WSS (opencode): ciphertext: Vec<u8> with no encoding spec — JSON WSS frames need base64url. Added explicit doc-comment in the CryptoBundle struct noting the base64url-string-on-the-wire / BSON-binary-in-MongoDB serde-adapter pattern. P2 — supported_features doesn't exist yet (opencode): Phase 1 work; ADR was writing about it as if it already existed. Added "(new field, added in Phase 1)" annotation. P2 — AGENTS.md actual-state correction (opencode): The AGENTS.md bullet "Error codes 8000-8003 are reserved" is itself out of date. ADR now acknowledges this explicitly and tells implementer to update AGENTS.md to reflect 8000-8005 (existing) plus 8006-8011 (new RCI range). P3 cleanups: - findAndModify -> find_one_and_update (idiomatic for the Rust driver) - srl_* -> sri_* typo in test names (3 occurrences) - ChronoAIProject/issues/ -> ChronoAIProject/NyxID/issues/ in reference links - Per-pending rate limit mechanism specified (in-memory DashMap<PendingId, RateLimitState>, TTL-ephemeral, distinct from global RATE_LIMIT_PER_SECOND middleware) - Standalone HTML weight target loosened from ~5 KB to ~30 KB gzipped (more realistic given @noble/curves x25519 alone is ~6 KB minified) with lazy-load fallback recommendation when SubtleCrypto is available natively - Multi-node fan-out: explicit behavior when expires_at fires during partial_decrypted (logical credential goes to expired, previously-accepted node credentials are NOT rolled back per idempotency principle) No protocol changes. No threat-model changes (only one wording clarification for T1's operational dependency). All edits are internal-consistency, encoding/transport spec sharpening, and naming/reference corrections. Calibration note: /plan-eng-review missed the P0 because it trusted the AGENTS.md "8000-8003 reserved" line at face value instead of grepping backend/src/errors/mod.rs directly. Cross-model independent review with code-level verification caught it. Worth adopting as a default check before locking error-code ranges in future ADRs.

kaiweijw · 2026-05-22T10:04:26Z

Follow-up commit 67fe444 docs(adr): fix P0 error code collision + 13 review findings from cross-model audit — applies findings from an independent cross-model review.

How this got caught

Two background agents (codex/gpt-5.5 xhigh, opencode/glm-5.1) ran review passes against this PR independently. Both surfaced the same P0 plus a long tail of P2/P3 items that my earlier /plan-eng-review missed. Cross-model agreement on the P0 was strong signal worth acting on immediately.

P0 — Error code collision

The ADR reserved 8004-8009 for remote credential injection errors. Verified against backend/src/errors/mod.rs:374-375:

NodeCredentialMissing => 8004  ← already in use
WsProxyDownstream     => 8005  ← already in use

Both AGENTS.md:85 ("Error codes 8000-8003 are reserved") and this ADR were wrong about the range being free. Shifted the RCI range to 8006-8011 with the same 6 names (also renamed PendingCredentialPubkeyNotPosted → PendingCredentialPubkeyAwaiting to reflect the polling-with-backoff semantics from finding 3). ADR now explicitly documents AGENTS.md's stale state so Phase 1 fixes both.

Other findings (all confirmed against code, fixed)

#	Finding	Severity	Source
1	Error code collision (above)	P0	both agents
2	T1 threat model overclaims — NyxID serves the SRI/HTML/verify UX so a malicious server can substitute it all in lockstep; Phase 4.5 only helps if admin verifies out-of-band	P1	codex
3	`supported_features` "no polling fallback" contradicts error 8007's "polls or falls back"	P2	codex
4	`/ciphertext` sync semantics underspecified — should reuse `send_credential_update_and_wait` pattern from `node_ws_manager.rs:1558`	P2	codex
5	Frame name: ADR used `pending_credential_available` singular with metadata; actual code at `ws_client.rs:433` uses `pending_credentials_available` plural with no metadata + HTTP pull	P2	codex
6	`ciphertext: Vec<u8>` traversing JSON WSS frames needs explicit base64url encoding spec	P2	opencode
7	`supported_features` field doesn't exist yet — Phase 1 work, but ADR wrote about it as present	P2	opencode
8	`AGENTS.md` line-85 text itself is stale (8004-8005 exist beyond the documented 8000-8003 range); ADR's reference is also fragile	P2	opencode
9	`findAndModify` is the MongoDB concept; the Rust driver method is `find_one_and_update`	P3	opencode
10	`srl_` typo for `sri_` in test names (3 occurrences)	P3	codex
11	Reference links `ChronoAIProject/issues/` should be `ChronoAIProject/NyxID/issues/`	P3	codex
12	Per-pending rate limit mechanism not specified	P3	opencode
13	Standalone HTML 5 KB target is unrealistic (`@noble/curves` x25519 alone is ~6 KB); loosen to ~30 KB gzipped with lazy-load fallback	P3	opencode
14	Multi-node fan-out has no expiry story when `partial_decrypted` hits `expires_at`	P3	opencode

Calibration

/plan-eng-review missed the P0 because it accepted AGENTS.md's "8000-8003 reserved" text without grepping the actual error-code source-of-truth in backend/src/errors/mod.rs. Both background agents caught it by reading the code first. Worth adopting "grep the code, not the docs, before locking error-code ranges" as a default check in future ADR reviews.

Diff stats

1 file changed, +64/-31 lines
ADR is now 505 lines (was 472 pre-fix)
No protocol changes, no threat-model changes (only T1 wording clarification)

Design document for #773 (org-admin remote credential injection without SSH-to-node). Status: proposed, eng-reviewed via /plan-eng-review with 9 issues raised and resolved, 0 unresolved, 0 critical gaps. Protocol shape: browser-driven end-to-end encrypted relay using X25519 ECDH + HKDF-SHA256 + XChaCha20-Poly1305. NyxID server is strictly a passthrough — never derives keys, never sees plaintext. Node generates an ephemeral X25519 keypair per pending credential, sealed at rest by the node's long-lived auth key (survives restart, dropped on consume). Eng review folded three scope expansions into the design: - Multi-node fan-out with all-N semantics (Phase 3.5) - Frontend code-integrity infrastructure: SRI hashes, signed release channel, admin verification UX (Phase 4.5) - Explicit Test Strategy section enumerating 25 mandatory tests by layer (node crypto, backend endpoints, frontend, cross-language interop fixtures, e2e, audit, regression, eval) Hardened design decisions captured: - Persistent sealed privkey on node (survives restart) instead of in-memory-only - Manual-accept gate is the default; auto-accept requires a second opt-in (preserves separation-of-duties for existing orgs) - First-push-wins atomicity + per-pending rate limit (1 successful POST, 3 failed in 60s, 5min lockout) - 16 KB ciphertext cap, error codes 8004-8009 reserved - CryptoBundle as a typed sub-document on NodePendingCredential - `supported_features` set on Node for version detection Closes the design phase of #773. Implementation tracked in subsequent PRs against the phases enumerated in the doc. Tracking: #769.

…l injection ADR

… codes The "Design Review Feedback & Best Practices" section added in 9b132a7 introduced refinements that didn't fully align with the existing spec elsewhere in the document. Reconcile: - Phase 3.5 (Multi-node fan-out) step 6 previously said "Decrypt failure on any one node bricks the logical credential" — which directly contradicted Best Practice §2's retry-only-failed-nodes semantics. Rewrite steps 5 and 6 to describe the retry path explicitly: partial_decrypted is a recoverable state, frontend re-runs Phase 3 against the failed subset, consumed is reached when all N have accepted. - The test row e2e_multi_node_partial_failure was renamed to e2e_multi_node_partial_failure_then_retry and updated to assert the recovery path including idempotency of previously-accepted nodes. - Error code 8009 (previously "(reserved)") is now formally assigned to PendingCredentialQueueFull — the natural use case for the 5-per-node ciphertext queue cap described in Best Practice §3. - Best Practice §1: replace the hardcoded "1 hour" TTL with a reference to NodePendingCredential.expires_at so the local sweep stays aligned with the server-side TTL if the server default ever changes. - Best Practice §4: pin the mechanism to MFA confirmation explicitly, since NyxID already has the /auth/mfa/verify endpoint. Note why multi-admin approval is not used (no existing approval-queue subsystem; would be a separate project). All five edits are internal-consistency / clarification only — no protocol or threat-model changes.

Seven non-blocking review notes from Grok addressed: 1. Privkey sealing — replace the vague "crypto/aes.rs style" reference with a concrete cross-reference to the existing cli/src/node/secret_backend.rs::SecretBackend trait, naming the sibling methods (store_auth_token, store_signing_secret, store_credential_value) so the Phase 2 implementer has no ambiguity about which mechanism to reuse. 2. WSS frame classification — add a one-paragraph clarification that the new pending_credential_pubkey / pending_credential_ciphertext frames are node-control protocol traffic (same class as node_metrics, proxy_request, etc.) and bypass the ws_frame_injections rules on DownstreamService / UserService. Implementation note points to node_ws.rs vs ws_frame_injector.rs. 3. Standalone credential-accept page — add a new subsection to Phase 4.5 specifying the page is served as minimal standalone HTML (strict CSP, no main SPA bundle, ~5 KB form shim) to reduce the substitution surface, and that the page's HTML hash is itself part of the signed release manifest. 4. Phase 4.5 runbook items — new table enumerating the infra questions that must be answered before the phase starts: GPG signing key holder, rotation procedure, releases.nyxid.dev hosting, manifest publish pipeline, signature scheme, expiry. Implementer documents the chosen answers in docs/RELEASE_INTEGRITY.md. 5. AGENTS.md vs CLAUDE.md correction — Grok caught a factual error: the error-code listing lives at AGENTS.md:85, not in CLAUDE.md. Phase 1 update target corrected. 6. Strip GSTACK REVIEW REPORT table — remove the process artifact from the committed ADR (eng-review summary moves to PR description and commit history). Keeps the ADR a pure design doc for future readers. 7. Browser crypto floor — sharpen the open question to call for real admin-population analytics before Phase 3, with a concrete metric format and a lazy-load suggestion if SubtleCrypto coverage is overwhelmingly high. No protocol or threat-model changes. All edits are clarifications, operational hardening, and doc hygiene.

…s-model audit Two background agents (codex/gpt-5.5 and opencode/glm-5.1) independently reviewed PR #822 and surfaced findings the /plan-eng-review missed. Cross-model agreement on the P0; the other 13 items are confirmed against actual code. P0 — Error code collision (both agents): The ADR reserved 8004-8009 for remote credential injection but backend/src/errors/mod.rs:374-375 already assigns 8004 to NodeCredentialMissing and 8005 to WsProxyDownstream. Implementing the ADR as written would have either duplicated public error codes or required a breaking renumber of live errors. Shift the RCI range to 8006-8011 with the following assignments: 8006 PendingCredentialDecryptFailed 8007 PendingCredentialVersionUnsupported 8008 PendingCredentialCiphertextTooLarge 8009 PendingCredentialPubkeyAwaiting (renamed from PubkeyNotPosted to reflect the polling-with-backoff semantics, not a terminal "not posted" state) 8010 PendingCredentialNodeOffline 8011 PendingCredentialQueueFull Test rows and failure mode table updated to reference the new numbers. P1 — T1 threat-model overclaim (codex): G6 and the T1 row claim protection against a fully-compromised NyxID server via Phase 4.5. But the standalone HTML, SRI hashes, fingerprint display, and verify button are all served by NyxID itself. A fully-compromised server can substitute them in lockstep. Downgrade T1 wording to make the operational dependency explicit: Phase 4.5 provides "detection assuming the admin verifies the fingerprint out-of-band against the signed manifest at a separate origin." Without admin verification, T1 degrades to T2 in practice. P2 — supported_features vs per-pending pubkey-readiness contradiction (codex): Section said "No timing guesses, no polling fallback" while error 8007 said "Frontend polls or falls back." Split into two distinct signals: feature detection (sync, cached, via supported_features set on the Node record) and per-pending pubkey readiness (async, polling with exponential backoff up to 30s, error code 8009 PendingCredentialPubkeyAwaiting). Feature detection is fast; polling is bounded to nodes that have already advertised crypto_v1. P2 — /ciphertext sync semantics underspecified (codex): Sequence diagram showed 4xx on node decrypt failure but didn't specify how the HTTP handler waits for the node. Reuse the existing send_credential_update_and_wait pattern at node_ws_manager.rs:1558 (allocate request_id, register oneshot waiter, send WSS frame, await ack with timeout). New section "Sync vs async response semantics" specifies the status-code mapping for consume / decrypt_failed / timeout / offline. P2 — Frame name correction (codex): Sequence diagram used pending_credential_available (singular, with metadata). Actual code at ws_client.rs:433 sends pending_credentials_available (plural, no metadata) as a nudge; node pulls details via existing GET /api/v1/node-agent/pending-credentials. Updated the diagram to reuse the existing nudge/pull pattern with pending_credential_pubkey added as a new WSS frame the node posts per pending credential lacking a sealed privkey. P2 — Ciphertext encoding for JSON WSS (opencode): ciphertext: Vec<u8> with no encoding spec — JSON WSS frames need base64url. Added explicit doc-comment in the CryptoBundle struct noting the base64url-string-on-the-wire / BSON-binary-in-MongoDB serde-adapter pattern. P2 — supported_features doesn't exist yet (opencode): Phase 1 work; ADR was writing about it as if it already existed. Added "(new field, added in Phase 1)" annotation. P2 — AGENTS.md actual-state correction (opencode): The AGENTS.md bullet "Error codes 8000-8003 are reserved" is itself out of date. ADR now acknowledges this explicitly and tells implementer to update AGENTS.md to reflect 8000-8005 (existing) plus 8006-8011 (new RCI range). P3 cleanups: - findAndModify -> find_one_and_update (idiomatic for the Rust driver) - srl_* -> sri_* typo in test names (3 occurrences) - ChronoAIProject/issues/ -> ChronoAIProject/NyxID/issues/ in reference links - Per-pending rate limit mechanism specified (in-memory DashMap<PendingId, RateLimitState>, TTL-ephemeral, distinct from global RATE_LIMIT_PER_SECOND middleware) - Standalone HTML weight target loosened from ~5 KB to ~30 KB gzipped (more realistic given @noble/curves x25519 alone is ~6 KB minified) with lazy-load fallback recommendation when SubtleCrypto is available natively - Multi-node fan-out: explicit behavior when expires_at fires during partial_decrypted (logical credential goes to expired, previously-accepted node credentials are NOT rolled back per idempotency principle) No protocol changes. No threat-model changes (only one wording clarification for T1's operational dependency). All edits are internal-consistency, encoding/transport spec sharpening, and naming/reference corrections. Calibration note: /plan-eng-review missed the P0 because it trusted the AGENTS.md "8000-8003 reserved" line at face value instead of grepping backend/src/errors/mod.rs directly. Cross-model independent review with code-level verification caught it. Worth adopting as a default check before locking error-code ranges in future ADRs.

…+ Grok) Three agents independently reviewed PR #822. Grok gave LGTM after exhaustive code cross-referencing; Codex and Opencode surfaced 14 specific items (2 P1, 5 P2, 5 P3, 2 informational). All addressed: P1 — Status field missing on model (Codex): CryptoBundle had no state tracking for the lifecycle the protocol describes. Add RemoteCryptoState enum (PubkeyPosted, CiphertextReceived, CiphertextQueued, Consumed, DecryptedPendingConfirmation, DecryptFailed, Expired) plus remote_state field on NodePendingCredential. For multi-node fan-out, add FanOutNodeState subdocument with per-node crypto + state. P1 — Offline handling contradicts sync POST (Codex): Error code 8010 said "queued" but sync section said 503. Resolved by supporting both: node online = sync wait (200/4xx/504 timeout); node offline at POST time = store ciphertext with 15-min queue TTL, return 202 with remote_state CiphertextQueued, forward on reconnect. The two paths are distinguished because the server knows node connectivity before attempting the WSS send. P1 — T1 + auto-accept tension (Opencode): Auto-accept means admin doesn't pause to verify fingerprints. Added security tradeoff note and recommendation for strict orgs. P2 — Feature flag visibility (Codex): Frontend checked crypto_v1 but couldn't tell if enable_remote_credential_injection was actually true. Now the detection section specifies three UI states: crypto-capable + enabled, crypto-capable + disabled (with enablement hint), and legacy. P2 — G6 + Phase 4.5 overclaim (Codex): G6 now says "Detection of malicious code-substitution... assuming admin independently verifies." Phase 4.5 intro aligned to match. P2 — 202 response novelty note (Opencode): Added RFC 7231 reference + explicit frontend handling note. P2 — Auto-accept migration warning (Opencode): One-time banner UX when feature first enabled. P3 — Primitives table encoding (Codex): "raw bytes" → "base64url" for ciphertext in JSON transports. P3 — WSS classification frame name (Codex): Singular pending_credential_available → plural pending_credentials_available. P3 — Failure table pubkey-awaiting (Codex): "legacy node" → "crypto_v1 node, async pubkey delay" with correct error code 8009 and polling semantics. P3 — Queue TTL vs metadata TTL (Opencode): Added one-sentence clarification of independence. P3 — Phase 3 effort estimate (Opencode): Bumped from 3d/4h to 4d/6h given expanded push form scope. Plus 4 new tests: RemoteCryptoState enum serde roundtrip, offline node returns 202 queued, queued ciphertext forwarded on reconnect, queued ciphertext expired after TTL. ADR is now 601 lines. Grok's LGTM + Codex/Opencode's corrections together give strong cross-model confidence in the design.

The CLI should support the entire remote credential injection flow natively — not just "push intent then switch to browser." The admin's locally-installed binary is immune to T1 code-substitution by design (no NyxID-served JS to substitute), making it the strongest trust anchor for security-conscious orgs. New `nyxid node-credential inject` subcommand: - Interactive mode: push + poll pubkey + prompt secret + encrypt + post ciphertext + wait for result — one command, full flow - Non-interactive mode: --secret-env reads from env var for CI / cron / rotation scripts - Org support: --org flag follows existing UUID/slug/name convention - Fallback: existing `push` command now prints BOTH browser URL and `inject --pending <id>` CLI command Shared crypto crate: Phase 2 (node decrypt) and Phase 5 (CLI encrypt) use the same x25519-dalek + chacha20poly1305 + hkdf + sha2 stack. Extract into nyxid-crypto workspace crate (~400 LOC) following the existing nyxid-cloud-auth precedent. Exposes encrypt() and decrypt() with AAD context binding. Browser @noble/* and Rust crate must produce identical output — interop fixtures verify this. Phase 5 effort bumped from 2d/3h to 3d/4h to reflect the full inject command + shared crate extraction. Test strategy adds 10 new tests: 5 for the shared crate, 5 for the inject command, plus 4 CLI-specific e2e tests covering interactive, --secret-env, --org, and the push fallback that prints both paths. Security note: for orgs that care about T1, the CLI path is the recommended approach — no SRI/fingerprint verification needed because the binary itself is the trust anchor.

When an admin works alongside an AI coding agent (Claude Code, Codex, OpenClaw, etc.), the agent has full terminal visibility. A masked "Enter secret value:" prompt in interactive mode would still expose the secret through the terminal session the agent can read. New --browser flag on nyxid node-credential inject: 1. CLI creates the pending credential via the API 2. Opens the default browser to the standalone accept page 3. Admin enters the secret in the browser (agent cannot see it) 4. Browser encrypts + submits (Phase 3 e2e crypto) 5. CLI polls pending state until consumed/failed/expired 6. CLI prints result — secret never appears in terminal transcript Follows the existing CLI wizard pattern from CLI_WIZARD_V3.md, already used for OAuth device-code flows. The inject command now has three modes: - Interactive (default): terminal prompt, direct human use - --secret-env VAR: env var, CI / automation - --browser: wizard, AI-agent-safe The push fallback now prints all three continuation paths (browser URL, inject --pending, inject --pending --browser) so the admin can pick. Added 3 CLI tests (wizard opens URL + polls, secret not in transcript, push prints all paths) and 2 e2e tests (browser wizard flow, push prints all paths).

Round 3 review by Codex (gpt-5.5), Opencode (glm-5.1), and Grok (grok-build). Grok gave LGTM again after delta review. Codex found 1 P1 + 2 P2 + 3 P3. Opencode found 2 P2 + 2 P3. All 9 addressed: P1 — CLI T1 claim overstated (Codex): CLI is immune to code-substitution but NOT to data-substitution. The node pubkey fetched from NyxID can be replaced by a compromised server (classic MITM on the key exchange). AEAD AAD binds metadata but not the pubkey's origin. Downgrade the claim to "immune to T1 code-substitution; still vulnerable to T1 data-substitution like the browser path." Add an operational note about out-of-band pubkey fingerprint verification and a future-work pointer to node-signed pubkeys via TOFU-pinned Ed25519 identity. P2 — AAD too narrow (Codex): AAD now includes injection_method, field_name, and target_url in addition to the existing node_id, pending_id, slug, version. T5 threat row updated to reflect metadata-tampering protection. A malicious relay that alters where the secret goes after submission now causes an AEAD tag failure on the node. P2 — PartialDecrypted missing from enum (Codex): Added to RemoteCryptoState for the multi-node fan-out aggregate state. Per-node states are in FanOutNodeState; the logical pending now has an explicit state for "some nodes accepted, some failed." P2 — 504 CiphertextReceived polling guidance (Opencode): After a sync-wait timeout (504), the admin should poll with exponential backoff watching for state transitions. Surface a "node may be unresponsive" warning after 2 minutes. P2 — CiphertextQueued in state diagram (Opencode): State transition diagram now shows the offline → queued → reconnect → consumed/expired branch alongside the online path. P3 — Consumed state tightened (Codex): Comment now says "terminal success: decrypted AND stored." The intermediate operator-confirm state is DecryptedPendingConfirmation, not Consumed. Prevents frontend/cleanup confusion. P3 — Interop test wording (Codex): "identical ciphertext" → "mutually decryptable." With random nonces identical plaintext will not produce identical ciphertext (by design). Interop fixtures use fixed test vectors (pinned keys + nonce) for byte-level verification. P3 — node config set command (Codex): The ADR referenced a `nyxid node config set` CLI subcommand that doesn't exist. Updated to point at the web UI node settings page and PATCH /api/v1/nodes/{id} as the enablement paths, with a note that Phase 1 should add the CLI subcommand or the API + web UI are the only paths. P3 — Feature-gate decrypt() in nyxid-crypto (Opencode): Backend depends on nyxid-crypto for CiphertextEnvelope + validation only. decrypt() is behind cfg(feature = "decrypt") so the backend binary cannot call it. Only nyxid-cli enables the feature.

…routes Round 4 cross-model review (Codex + Opencode + Grok). Grok LGTM again. Codex and Opencode found 6 text-consistency items (no protocol changes): 1. Top-level context + G2 now explicitly scope the confidentiality claim to T2 (passive compromise) and reference the T1 pubkey-MITM caveat. No more "strictly preserves" language without qualification. 2. T1 data-substitution caveat now lists the full 7-field AAD (was listing only 4 — a stale reference from before the AAD expansion). 3. New section documenting why HKDF info (4 fields) and AEAD AAD (7 fields) intentionally differ: HKDF for domain separation, AAD for integrity. Includes the canonical wire format (length-prefixed u16_be fields) and the Option<String> sentinel convention (None → empty string → u16_be(0)) to prevent downgrade attacks. 4. require_operator_confirm_for_remote reframed as separation-of-duties control, not T1 confidentiality defense. The gate runs after secret submission — too late to prevent pubkey-MITM exfiltration. 5. API routes unified: CLI section now uses /nodes/{node_id}/credentials/pending/{pending_id} (matching the sequence diagram and existing routes.rs at lines 545-546), not the shorter /credentials/pending/{id} which doesn't exist. 6. PartialDecrypted documented as fan-out aggregate-only state with explicit MUST NOT appear note for FanOutNodeState.remote_state. State diagram note added.

…ended in v1' Codex repeatedly flagged (rounds 3-5) that the threat table presented T1 and T3 as "defended" while the protocol text admitted the pubkey- substitution MITM defeats both. Opencode and Grok said ship, but Codex's structural point is valid: a security reviewer reads the threat table first, and "defended" there contradicts the caveats deeper in the doc. Split the table into two sections: "Adversaries defended in v1" — T2, T4, T5, T6, T7, T8. These are fully defended by the e2e crypto, AAD binding, atomicity, forward secrecy, rate limiting, and sealed privkey persistence. T2 (passive read) is now explicitly marked as the primary confidentiality guarantee. "Acknowledged threats — NOT defended in v1" — T1 and T3. Both share the same key-exchange MITM weakness. Each row now explains WHY it's not defended (pubkey substitution), what mitigation exists TODAY (out-of-band fingerprint verification), and what the future fix is (node-signed ephemeral pubkeys via TOFU-pinned Ed25519 identity). G6 narrowed from "detection of malicious code-substitution" to explicitly "JS code-substitution only, not pubkey substitution."

kaiweijw · 2026-06-04T07:16:39Z

📊 当前状态 — ADR 独立安全审计已派(不需要人介入)

维度	值
对象	本 PR 的 ADR(远程凭证注入 E2E 加密协议)
动作	一个独立 auditor codex 正在读 ADR + 对照真实 NyxID 代码,审威胁模型 / 原语 / AAD 绑定 / key 生命周期 / "server 永不见明文" 不变量 / 向后兼容 / 错误码完整性
产出	审计结论将作为评论回贴本 PR
说明	本 PR 为纯文档 ADR;合入 `main` 仍由 maintainer 决定(本循环不自动合 main)
是否需要人介入	❌ 否(等审计结论)

🤖 controller status banner

⟦AI:AUTO-LOOP⟧

kaiweijw · 2026-06-04T07:24:51Z

🤖 ADR #822 Remote Credential Injection 独立安全/设计审计

TL;DR

是什么：我以只读方式审阅了 PR docs(adr): remote credential injection design for #773 (eng-reviewed) #822 的 ADR、issue [Feature] Remote credential injection path for org admins (no SSH-to-node required) #773 / PR 元数据，以及当前 codebase 中 NodePendingCredential、pending credential handlers、node WSS、SecretBackend、error-code 路径；当前 pending model 仍是 metadata-only，没有 secret 字段（backend/src/models/node_pending_credential.rs:26-47）。
结论：ADR 对 T2/passive compromise 的核心机密性边界基本成立，且已诚实地把 T1/T3 标为 v1 不防御（ADR §"Threat model" lines 51-56）。但我发现 10 个问题，其中 1 个 blocking：offline ciphertext queue 的 15 分钟 TTL 在单节点数据模型里没有持久化字段可执行（ADR §"Data model" lines 181-255；ADR §"Design Review Feedback & Best Practices" lines 707-709）。
下一步：先修 F1；F2 至少要在 Phase 3.5 前修；F3-F8 建议并入 Phase 1/2 的验收条件（证据和建议见下表）。

ID	severity	证据	建议
F1	blocking	ADR 要求 queued ciphertext 使用独立 15 分钟 TTL，但 `CryptoBundle` / single-node `NodePendingCredential` 没有 `ciphertext_queued_at` 或 `ciphertext_expires_at` 字段；当前 model 也没有可用字段（ADR §"Data model" lines 181-255；ADR §"Sync vs async response semantics" line 377；`backend/src/models/node_pending_credential.rs:26-47`）。	增加 queue timestamp / expiry 字段、索引和 sweeper 规则；fan-out per-node 状态也要同样可执行。
F2	high	ADR 说 multi-node fan-out 面向 `fallback_node_ids`，但当前 `NodeRoute` 只返回 viable online / WS-connected nodes（ADR §"Multi-node fan-out" lines 349-359；`backend/src/services/node_routing_service.rs:21-24`、`:154-165`）。	不要把 proxy failover 的 `NodeRoute` 当作 fan-out 目标集；定义 all intended active bindings resolver，再对每个 node 做 online delivery 或 offline queue。
F3	medium	ADR 写 pending IDs 约 256 bits entropy；当前 pending ID 是 `Uuid::new_v4()`，约 122 random bits（ADR §"Threat model" line 48；`backend/src/services/node_pending_credential_service.rs:60-62`）。	改文档为 UUID v4 约 122 bits，或改协议为 32-byte random pending IDs。
F4	medium	ADR 只写 base64url，没有定义 padded / no-pad；当前 codebase 同时存在 URL_SAFE_NO_PAD 和 STANDARD base64 用法（ADR §"Cryptographic primitives" line 170；`cli/src/auth.rs:7`、`:47`；`backend/src/handlers/node_ws.rs:149`）。	明确 RCI JSON 字段使用 `URL_SAFE_NO_PAD`，或定义接受规则和 canonical encoder；同时验证 decoded lengths。
F5	medium	AAD 绑定 `injection_method`，但 canonical token 未定义；pending API 使用 `query-param` / `path-prefix`，live `credential_update` path 使用 `query_param` / `path_prefix`（ADR §"HKDF info vs AEAD AAD" lines 174-177；`cli/src/cli.rs:1555-1561`；`cli/src/node/ws_client.rs:3036-3074`）。	在 shared protocol / crypto crate 中提供 `RciAadContext` builder；增加固定 domain labels，并固定 injection method canonical form。
F6	medium	16 KB cap 写在 POST / model / tests 上，但 WSS forward、queued replay、fan-out aggregate body 的边界校验没有明确写入 ADR（ADR §"Data model" line 194；ADR §"Error codes" line 494；`backend/src/services/node_ws_manager.rs:1537-1540`）。	一个 `MAX_CIPHERTEXT_SIZE` 用于 HTTP decode、Mongo update、WSS frame construction、queued replay、fan-out per-node item。
F7	medium	ADR 要求 consume/decline/cancel/expire 都驱逐 sealed privkey，但 cancel control-plane frame 未明确；当前 admin cancel 只更新 DB 并写 audit（ADR §"Test Strategy" lines 514-517；`backend/src/handlers/node_admin.rs:637-664`）。	明确 `pending_credential_canceled` / expired frame，或定义 polling diff cleanup；local sweep 只能作为 fallback。
F8	medium	operator-confirm path 会在 node 内存里保存 decrypted secret，但 ADR 只写 “volatile ready-to-accept queue with TTL”，没有 TTL 长度、zeroization、cancel 行为（ADR §"Flow" lines 332-335；ADR §"Accept gate" lines 418-420）。	TTL 不超过 `expires_at`，plaintext buffer 用 `Zeroizing`，所有 drop/cancel/expire 路径显式 zeroize。
F9	low	SRI 小节单独阅读时有过度表述：fully compromised server 可以同时替换 HTML + SRI；ADR 其他小节已正确要求 separate signed manifest + out-of-band admin verification（ADR §"Code-integrity infrastructure" line 430；ADR §"SRI hashes" line 443；ADR §"Signed release channel" lines 447-452）。	把 SRI 段落改成 “HTML 也被替换时，T1 detection 依赖 signed manifest comparison”。
F10	nit	ADR 只点名 `AGENTS.md` stale，但 `CLAUDE.md` 同一条也 stale；code 已占用 8000-8005（ADR §"Error codes" lines 488-499；`backend/src/errors/mod.rs:374-379`；`AGENTS.md:85`；`CLAUDE.md:85`）。	Phase 1 同时更新两个文档，或明确一个为生成/唯一 authoritative source。

原始审计 artifact（English）

# ADR PR #822 Remote Credential Injection - Independent Security/Design Audit

Audit date: 2026-06-04  
Scope: ADR from `git show origin/docs/issue-773-remote-credential-injection:docs/REMOTE_CREDENTIAL_INJECTION.md`, issue #773, PR #822, and current NyxID checkout.  
Mode: read-only source audit; no source code changes.

## Executive Verdict

The ADR is directionally sound for its main confidentiality invariant under the stated passive-compromise model: the remote path places encryption in the admin browser/CLI and decryption on the node, while the server stores/forwards `admin_pubkey`, `nonce`, and ciphertext only (`ADR §"Goals"` lines 23-27, `ADR §"Data model"` lines 181-203, `ADR §"Shared crypto crate"` line 620). The document is also materially more honest than earlier versions: T1 and T3 are explicitly moved to "NOT defended in v1" because the server can substitute the node pubkey (`ADR §"Threat model"` lines 51-56, `ADR §"T1 data-substitution caveat"` line 97).

I found no design path where NyxID intentionally receives plaintext or derives the shared key under T2, assuming the backend keeps the `decrypt` feature disabled as specified (`ADR §"Shared crypto crate"` line 620). The current code also supports the baseline claim that today's pending credential flow stores metadata only: `NodePendingCredential` has no secret field (`backend/src/models/node_pending_credential.rs:26-47`), admin push logs only metadata (`backend/src/handlers/node_admin.rs:589-599`), and node consume/decline audit events are metadata-only (`backend/src/handlers/node_agent.rs:93-107`, `backend/src/handlers/node_agent.rs:127-145`).

However, I do not consider the ADR fully implementation-ready until the offline queue TTL is represented in the data model. The ADR requires a 15-minute ciphertext queue TTL independent from `NodePendingCredential.expires_at`, but the proposed model has no single-node queued-at / queued-expires-at field to enforce that requirement (`ADR §"Data model"` lines 181-255, `ADR §"Sync vs async response semantics"` line 377, `ADR §"Design Review Feedback & Best Practices"` lines 707-709).

## Findings Table

| ID | Severity | Finding | Evidence | Recommendation |
|---|---|---|---|---|
| F1 | blocking | Offline ciphertext queue TTL is required but not representable in the proposed single-node data model. The ADR says queued ciphertext uses a shortened 15-minute TTL independent from metadata `expires_at`, but `CryptoBundle` has only `version`, pubkeys, nonce, and ciphertext, while `NodePendingCredential` only adds `crypto`, `remote_state`, and optional `fan_out_nodes`. The current model likewise has no `updated_at` or queue-expiry field. Without a persisted queue timestamp/expiry, a sweeper cannot distinguish "queued 15 minutes ago" from "pending metadata still valid for 1 hour." | `ADR §"Data model"` lines 181-203 and 237-255; `ADR §"Sync vs async response semantics"` line 377; `ADR §"Design Review Feedback & Best Practices"` lines 707-709; `backend/src/models/node_pending_credential.rs:26-47` | Add explicit queue lifecycle fields, e.g. `ciphertext_queued_at` and `ciphertext_expires_at` on single-node pending credentials and per fan-out node, using the required BSON chrono serde helpers. Define the cleanup query/index and reconnect-forward rule against those fields. |
| F2 | high | Multi-node fan-out target selection is underspecified and conflicts with current `NodeRoute` semantics if implementers reuse `fallback_node_ids`. The ADR says fan-out applies to services with `fallback_node_ids`, but the current routing service only returns viable online/WS-connected nodes as primary/fallbacks. That is correct for proxy failover, but wrong as a credential fan-out source because offline nodes are exactly the nodes that need queued ciphertext. | `ADR §"Multi-node fan-out"` lines 349-359; `backend/src/services/node_routing_service.rs:21-24`, `backend/src/services/node_routing_service.rs:154-165`, `backend/src/services/node_routing_service.rs:129-149` | Specify a separate fan-out target resolver: all intended active node bindings for the service/owner, with per-node ACL checks, not the viability-filtered `NodeRoute`. Then apply online sync delivery or offline queue per target. |
| F3 | medium | T7 overstates pending ID entropy. The ADR says pending IDs have approximately 256 bits of entropy, but current pending IDs are UUID v4 strings generated by `Uuid::new_v4()`, which provide about 122 random bits. That is still practically unguessable, but the ADR's evidence is inaccurate. | `ADR §"Threat model"` line 48; `backend/src/services/node_pending_credential_service.rs:60-62` | Correct the ADR to "~122 bits via UUID v4" or explicitly switch RCI pending IDs to 32-byte random identifiers if 256-bit entropy is a hard requirement. |
| F4 | medium | Base64url encoding is not pinned to padded vs unpadded form. The ADR says "base64url" for keys, nonce, and ciphertext, while the current codebase uses both URL-safe no-padding encodings for JWT-like material and standard base64 for WSS binary payloads. Interop fixtures will catch this late, but implementers need the canonical rule up front. | `ADR §"Cryptographic primitives"` line 170; `ADR §"Data model"` lines 196-200; `cli/src/auth.rs:7`, `cli/src/auth.rs:47`; `backend/src/handlers/node_ws.rs:149`; `cli/src/node/ws_client.rs:1088` | Specify `base64::engine::general_purpose::URL_SAFE_NO_PAD` for all RCI JSON fields, or specify an accepting decoder plus canonical encoder. Include exact decoded lengths: 32-byte pubkeys, 24-byte nonce, `<= 16 * 1024` ciphertext+tag. |
| F5 | medium | Crypto context canonicalization needs to be centralized. The ADR binds `injection_method` in AAD but does not define the canonical string source. Current pending credential APIs use kebab-case (`query-param`, `path-prefix`), while the live WSS `credential_update` apply path uses underscore tokens (`query_param`, `path_prefix`). The remote path can be correct, but only if all encrypt/decrypt sides use the same canonical pending-credential tokens. | `ADR §"HKDF info vs AEAD AAD"` lines 174-177; `backend/src/models/node_pending_credential.rs:16-22`; `cli/src/cli.rs:1555-1561`; `cli/src/node/ws_client.rs:3007-3010`, `cli/src/node/ws_client.rs:3036-3037`, `cli/src/node/ws_client.rs:3073-3074` | Put `RciAadContext` / `RciKdfInfo` builders in the shared crypto crate or a shared protocol module. Use explicit static domain labels such as `nyxid:rci:v1:kdf` and `nyxid:rci:v1:aad`, and define `injection_method` as the pending API's kebab-case token or an enum discriminant. |
| F6 | medium | The 16 KB ciphertext cap is specified for POST, but not explicitly for WSS forward, queued replay, or fan-out aggregate bodies. Current WSS writer paths only serialize and `try_send`; the ADR's tests only name the HTTP over-size rejection. A future bug or queue replay path could forward an oversized stored blob unless the same constant is checked at every boundary. | `ADR §"Data model"` line 194; `ADR §"Flow"` line 319; `ADR §"Error codes"` line 494; `ADR §"Test Strategy"` line 545; `backend/src/services/node_ws_manager.rs:1537-1540`, `backend/src/services/node_ws_manager.rs:1583-1591` | Define one `MAX_CIPHERTEXT_SIZE` and require checks on HTTP decode, MongoDB update, WSS frame construction, queued replay, and each fan-out element. Add tests for queued replay and fan-out total/per-node limits. |
| F7 | medium | Cancel/decline/expire key eviction is required but the control-plane message is not specified. The ADR requires sealed privkey eviction on consume/decline/cancel/expire and mentions missed cancel WSS frames, but the current admin cancel endpoint only deactivates MongoDB state and writes audit metadata; it does not notify the node. | `ADR §"Test Strategy"` lines 514-517; `ADR §"Design Review Feedback & Best Practices"` lines 699-701; `backend/src/handlers/node_admin.rs:637-664`; `backend/src/services/node_pending_credential_service.rs:127-150` | Add explicit node-control frames such as `pending_credential_canceled` / `pending_credential_expired`, or specify a polling diff protocol that causes deletion. Keep local sweep as a fallback, not the only cancel cleanup path. |
| F8 | medium | The operator-confirm path creates a plaintext-on-node memory queue, but the ADR only says "volatile ready-to-accept queue with TTL" and does not define TTL length, cancellation behavior, or zeroization. This is on the trusted node, not on NyxID, so it does not break the core server invariant; it is still key lifecycle surface that should be deterministic. | `ADR §"Flow"` lines 332-335; `ADR §"Accept gate"` lines 418-420; current node secret handling uses explicit secret backends at `cli/src/node/secret_backend.rs:121-152` | Specify the queue TTL, upper-bound it by `NodePendingCredential.expires_at`, use `Zeroizing` storage for plaintext buffers, and state that cancel/decline/expire/drop paths zeroize the queue entry before removal. |
| F9 | low | The SRI section has one self-contained overstatement. It says NyxID cannot silently substitute JS without changing the HTML's SRI attribute loaded over TLS. Under the ADR's own T1 model, a fully compromised server can also change that HTML; detection depends on the separate signed manifest and out-of-band admin verification. Other sections say this correctly, so this is a wording fix, not a protocol flaw. | `ADR §"Code-integrity infrastructure"` line 430; `ADR §"SRI hashes on crypto JS bundles"` line 443; `ADR §"Signed release channel"` lines 447-452; `ADR §"Admin verification UX"` lines 456-461 | Qualify the SRI paragraph: "SRI blocks stale/tampered bundles unless the HTML is also substituted; T1 detection requires comparing the HTML/bundle hash to the signed release manifest at the separate origin." |
| F10 | nit | Error-code documentation drift exists in both `AGENTS.md` and `CLAUDE.md`, but the ADR only calls out `AGENTS.md`. The code already assigns 8000-8005, while both docs still say 8000-8003. | `ADR §"Error codes"` lines 488-499; `backend/src/errors/mod.rs:374-379`; `AGENTS.md:85`; `CLAUDE.md:85` | Update both docs or make one explicitly generated/authoritative. The ADR's 8006-8011 allocation itself does not collide with current code. |

## Core Invariant Review

Under T2/passive compromise, the ADR's plaintext boundary is credible. The browser/CLI encrypts before POST (`ADR §"Browser-based push"` lines 85-91, `ADR §"CLI path"` lines 107-114), the server stores only the `CryptoBundle` envelope (`ADR §"Data model"` lines 181-203), and the backend is explicitly forbidden from enabling `decrypt` in the shared crate (`ADR §"Shared crypto crate"` line 620). Offline queuing stores ciphertext server-side, not plaintext (`ADR §"Sync vs async response semantics"` line 377).

Under T1/active compromise, the ADR is now honest: the server can replace the node pubkey and MITM the exchange (`ADR §"Threat model"` lines 51-56, `ADR §"T1 data-substitution caveat"` line 97). Phase 4.5 detects JS substitution only when the admin verifies the signed manifest out-of-band (`ADR §"Goals"` line 27, `ADR §"Code-integrity infrastructure"` line 430). That is a correct v1 boundary.

I found no current code path that contradicts the legacy invariant. The existing admin `node-credential push` posts metadata only (`cli/src/commands/node_credential.rs:23-32`) and prints "Do not send the secret value" (`cli/src/commands/node_credential.rs:48-54`). The node-side accept command prompts/stores locally, then marks consumed through the node-agent API (`cli/src/node/agent.rs:466-505`). The existing model stores pending metadata only (`backend/src/models/node_pending_credential.rs:26-47`).

## Crypto Envelope Review

The primitive choices are reasonable: X25519, HKDF-SHA256, and XChaCha20-Poly1305 are appropriate for an ephemeral ECDH envelope (`ADR §"Cryptographic primitives"` lines 162-170). Random 24-byte nonces for XChaCha20-Poly1305 are suitable if generated per encryption and covered by interop/unit tests (`ADR §"Test Strategy"` lines 519-527).

The main crypto-design improvements are not primitive changes; they are canonicalization controls. The KDF/AAD encoding should include fixed protocol labels and be generated by a shared builder, not handwritten in frontend, CLI, and node code (`ADR §"HKDF info vs AEAD AAD"` lines 174-177). Base64url must be pinned to padded or unpadded form (`ADR §"Cryptographic primitives"` line 170). These are medium-severity interop and future-proofing findings, not reasons to abandon the design.

## Race, DoS, And Queue Review

The first-push-wins plan is correct in shape: `find_one_and_update` with a null-ciphertext guard is the right MongoDB primitive (`ADR §"Race protection"` line 363). The strict sync ACK requirement also matches existing code: `send_credential_update_and_wait` allocates a request ID, registers a oneshot waiter, sends over WSS, and maps timeout/drop/negative ACK to node-failure errors (`backend/src/services/node_ws_manager.rs:1551-1633`).

The DoS/resource gaps are in queue representation and cap placement. The 15-minute queue TTL is not representable in the proposed single-node model (F1). The 16 KB cap should be repeated at all ingress and egress boundaries, not only HTTP POST (F6). The per-pending in-memory rate limiter is acceptable as a local throttle, but in multi-instance deployments it is not a global guarantee; the current node agent already acknowledges multi-instance WSS worker split behavior (`cli/src/node/ws_client.rs:99-101`). If NyxID deploys multiple backend workers, Phase 1 should document whether the rate limiter is best-effort per worker or move it to MongoDB/Redis.

## Backward Compatibility Review

The ADR preserves legacy `crypto: None` behavior (`ADR §"Backward compatibility & version detection"` lines 383-403, `ADR §"Accept gate"` lines 416-420). That aligns with the current split between laptop-side `nyxid node-credential push` and node-side `nyxid node credentials accept`: user-side push creates metadata only (`cli/src/commands/node_credential.rs:23-32`), node-side accept prompts for the secret and stores locally (`cli/src/node/agent.rs:466-505`, `cli/src/node/agent.rs:1331-1363`). I did not find a forced migration path in the ADR.

## Error-Code Review

The ADR's 8006-8011 range is non-overlapping with current code. Current node errors occupy 8000-8005 (`backend/src/errors/mod.rs:374-379`), and the ADR correctly reserves the next six slots (`ADR §"Error codes"` lines 486-497). The doc cleanup should update both `AGENTS.md:85` and `CLAUDE.md:85`, not only AGENTS.

## Recommended Gate Before Implementation

Fix F1 before treating the ADR as implementation-ready. F2 should be resolved before Phase 3.5 starts, but it does not block Phase 1 single-node stubs if the fan-out endpoint is explicitly deferred. F3-F8 should be folded into Phase 1/2 acceptance criteria because they are cheapest to fix at protocol-boundary time.

⟦AI:AUTO-LOOP⟧

错误码 8006-8011 + CryptoBundle/RemoteCryptoState + 离线队列生命周期字段(审计 F1)+ redacting Debug(CLAUDE §5)+ 两个 partial 索引/sweep + node-error 真值同步。design-consensus #872 + 独立安全审计 #822 + 3-reviewer 双轮。Refs #773

…ases 1–6 rollup) (#895) * feat(backend): #773 Phase 1 远程凭证注入后端协议骨架 (#873) 错误码 8006-8011 + CryptoBundle/RemoteCryptoState + 离线队列生命周期字段(审计 F1)+ redacting Debug(CLAUDE §5)+ 两个 partial 索引/sweep + node-error 真值同步。design-consensus #872 + 独立安全审计 #822 + 3-reviewer 双轮。Refs #773 * feat(crypto): #773 Phase 2 node 端 RCI crypto + nyxid-crypto crate (#877) CLI/node-owned nyxid-crypto(X25519/HKDF/XChaCha20)+ node sealed-key 生命周期 + CI dependency-graph invariant 证明 backend 零 crypto 依赖(backend LOC=0)。design-consensus #874(r4)+ 独立 verify + 2 轮 3-reviewer。backend WS 路由见 Phase 2.5 #876。Refs #773 * feat(backend): #773 Phase 2.5 RCI backend WS 帧路由(非 crypto) (#879) existing-service glue;backend 零 crypto 依赖(CI-enforced);async 202;GET pubkey/POST ciphertext/WS frame/reconnect drain;删 DecryptedPendingConfirmation;ACL 负向 + 行为测试。design-consensus #876 r6(supervisor 覆盖误判 drop)。Refs #773 * feat(audit): #773 Phase 4 — RCI metadata-only audit + observability (#881) Typed RciAudit* interface; metadata-only audit for all 18 node_credential_rci_* emission branches with read-back coverage; single error-code taxonomy source; backend stays zero-crypto (boundary gate passes); legacy crypto:None unchanged. Closes #878. Reviews: architect comment, tests approve, quality comment (advisory). ⟦AI:AUTO-LOOP⟧ * feat(frontend): #773 Phase 3 — browser WebCrypto RCI push + accept (#882) Browser @noble E2E-encrypted remote credential injection: lib/crypto (X25519+HKDF+XChaCha20-Poly1305, base64url no-pad), metadata-only RHF+Zod push forms (remote_crypto literal-true), independent credential-accept page (uncontrolled secret, browser-only encrypt, posts only CiphertextEnvelope), interop fixture mutual-decryptability. return_to open-redirect hardened via shared trusted-origin guard. Backend + nyxid-crypto zero diff. Closes #880. Reviews: architect approve (security re-review) / tests approve / quality approve. ⟦AI:AUTO-LOOP⟧ * feat(node): #773 Phase 3.5 — multi-node credential fan-out (#884) Embedded fan_out_nodes + revision-guard atomic N-ciphertext submit; separate offline-aware CredentialFanOutTargetResolver with per-node ACL (no NodeRoute coupling); all-N/partial_decrypted/retry-failed/partial->expired; F6 per-element+aggregate+MAX_FAN_OUT_TARGETS caps; metadata-only fan-out audit via typed rci_audit_service; route-level success + boundary coverage. Backend zero-crypto (boundary exit 0); single-node Phase 3 + legacy crypto:None byte-unchanged. Closes #883. Reviews: architect approve / tests approve / quality comment. ⟦AI:AUTO-LOOP⟧ * feat(security): #773 Phase 4.5 — code-integrity (standalone accept page + SRI + admin verify) (#887) Standalone hardened credential-accept page via zero-crypto backend handler (strict CSP + SHA-384 SRI, replaces Phase 3 SPA accept); admin fingerprint-verify UX; releases.json manifest + byte-identical browser/Node fingerprint canonicalization; vite plugin preserves data-nyx-integrity-role + backend 503 fail-closed; release workflow disabled=>warn / enabled+complete=>sign / enabled+incomplete=>fail-fast (no fabricated infra); docs/RELEASE_INTEGRITY.md 6-item runbook + reachability; service-layer policy; Zod-validated runtime-config. Backend zero-crypto; single-node+fan-out(retry)+legacy crypto:None byte-unchanged; integrity_verification audit-only. Closes #885. Reviews: architect approve / tests approve / quality comment. ⟦AI:AUTO-LOOP⟧ * feat(cli): #773 Phase 5 — CLI inject (nyxid node-credential inject) (#891) Admin-side full-e2e remote credential injection via the local nyxid binary (T1 code-substitution-immune): push remote_crypto -> poll pubkey -> encrypt locally (reusing the Phase 2 shared RciContext, byte-identical to node decrypt) -> post ciphertext. interactive/--secret-env/--browser modes; strict single-format pubkey fingerprint helper shared with the node-agent console for out-of-band verification; Zeroizing secret never logged; --browser reads no secret; inject --pending metadata-only-inits legacy pendings (no backend crypto); push default unchanged; fan-out inject deferred. Backend stays ZERO crypto (boundary exit 0). Closes #889. Reviews: architect approve / tests approve / quality approve. ⟦AI:AUTO-LOOP⟧ * docs(cli): #773 Phase 6 — RCI discovery hint rewrites (#894) Text-only CLI hint rewrites in node-credential + service commands so agents/admins discover the full RCI flow (push metadata-only -> inject/--browser/--secret-env/accept page, out-of-band --verify-fingerprint, org opt-out, error codes 8006-8011) with honest detection-only/T1 framing. Shared RciCliHintLines renderer; tests assert exact merged commands + reject phantom flags. No behavior/flag/crypto change; backend/nyxid-crypto/frontend untouched; boundary exit 0. Closes #892. Reviews: architect approve / tests approve / quality comment. FINAL phase of the #773 remote-credential-injection feature. ⟦AI:AUTO-LOOP⟧ * fix(rci): resolve clippy lints under current stable for #773 rollup (#895) CI runs clippy against unpinned latest-stable (rustc 1.96); these lints were clean under the older stable the phase PRs ran with. All in RCI code: - node_credential.rs: collapse nested if-let via .filter() (production poll loop) - credential_accept.rs: let-else -> ? in SRI script-tag parser (production) - node_credential.rs tests: #[allow(clippy::await_holding_lock)] on 4 env_lock single-thread tests (env mutation serialized; held across await by design) - node_pending_credential_service.rs tests: &[x.clone()] -> std::slice::from_ref(&x) Verified: cargo +stable clippy --workspace --all-targets -- -D warnings (1.96) clean; backend zero-crypto boundary unaffected (test/lint-only + 2 trivial prod rewrites). * style(rci): cargo fmt the slice::from_ref test call wrapping (#895) Pure rustfmt formatting of the 3 insert_fan_out_pending test calls touched by the prior clippy fix; no logic change. * docs(skill): #773 document remote credential injection in nyxid skill (#895) The 8 RCI phases shipped `nyxid node-credential inject` (browser/laptop admin-side end-to-end encryption) but never updated the agent-facing skill. - references/nodes.md: new 'Remote credential injection' subsection — inject (interactive/--secret-env/--browser), out-of-band fingerprint verification, secret-never-on-argv, browser accept page (SRI/signed manifest; detection-not-prevention), web-UI multi-node fan-out, and error codes 8006-8011. - SKILL.md: add inject / remote-injection triggers to the nodes reference map row; add an end-to-end-encryption bullet to Security and Privacy. Docs only; no version bump (release process owns skill versioning).

kaiweijw marked this pull request as draft May 25, 2026 03:57

kaiweijw added 6 commits May 25, 2026 12:09

docs(adr): add review feedback and best practices to remote credentia…

7738bb1

…l injection ADR

docs(adr): add browser-based push + auto-accept for remote crypto path

8ddea80

kaiweijw force-pushed the docs/issue-773-remote-credential-injection branch from 67fe444 to 8ddea80 Compare May 25, 2026 04:15

kaiweijw added 6 commits May 25, 2026 14:35

This was referenced Jun 4, 2026

[设计] #773 Phase 1 — 后端协议骨架 + 错误码(远程凭证注入 RCI) #872

Closed

[Feature] Remote credential injection path for org admins (no SSH-to-node required) #773

Closed

kaiweijw mentioned this pull request Jun 4, 2026

feat(backend): #773 Phase 1 远程凭证注入后端协议骨架(错误码 8006-8011 + 模型 + 队列生命周期) #873

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(adr): remote credential injection design for #773 (eng-reviewed)#822

docs(adr): remote credential injection design for #773 (eng-reviewed)#822
kaiweijw wants to merge 12 commits into
mainfrom
docs/issue-773-remote-credential-injection

kaiweijw commented May 22, 2026

Uh oh!

kaiweijw commented May 22, 2026

Uh oh!

kaiweijw commented May 22, 2026

Uh oh!

kaiweijw commented Jun 4, 2026

Uh oh!

kaiweijw commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaiweijw commented May 22, 2026

Summary

What the ADR specifies

Eng review summary

Implementation phases

What's NOT in scope (for the implementation that follows this ADR)

Test plan for this PR

Related

Uh oh!

kaiweijw commented May 22, 2026

What changed

Eng-review audit trail (moved from the ADR per item 6)

Uh oh!

kaiweijw commented May 22, 2026

How this got caught

P0 — Error code collision

Other findings (all confirmed against code, fixed)

Calibration

Diff stats

Uh oh!

kaiweijw commented Jun 4, 2026

📊 当前状态 — ADR 独立安全审计已派(不需要人介入)

Uh oh!

kaiweijw commented Jun 4, 2026

🤖 ADR #822 Remote Credential Injection 独立安全/设计审计

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant