Skip to content

docs(adr): remote credential injection design for #773 (eng-reviewed)#822

Draft
kaiweijw wants to merge 12 commits into
mainfrom
docs/issue-773-remote-credential-injection
Draft

docs(adr): remote credential injection design for #773 (eng-reviewed)#822
kaiweijw wants to merge 12 commits into
mainfrom
docs/issue-773-remote-credential-injection

Conversation

@kaiweijw

Copy link
Copy Markdown
Collaborator

Summary

Design document for #773 — letting org admins supply secrets to node-managed services from any device with a browser, with no SSH-to-node required, while strictly preserving the existing "secret never on NyxID server" invariant.

Status: proposed, eng-reviewed via /plan-eng-review with 9 issues raised and resolved, 0 unresolved, 0 critical gaps.

What the ADR specifies

Browser-driven end-to-end encrypted relay:

  • X25519 ephemeral keypairs per pending credential (node-side persisted, sealed by long-lived auth key — survives restarts, dropped on consume)
  • HKDF-SHA256 key derivation with (node_id, pending_credential_id, slug, version) binding context
  • XChaCha20-Poly1305 AEAD with AAD binding to prevent cross-credential replay
  • NyxID is strictly a passthrough relay — never sees plaintext, never derives the shared key
  • Frontend code-integrity (Phase 4.5): SRI hashes on crypto bundles, signed release channel at a separate domain, admin verification UX with fingerprint check
  • Multi-node fan-out (Phase 3.5) for services with fallback_node_ids — same protocol per node, all-N semantics, per-node partial-state breakdown
  • Manual-accept gate is the default — preserves today's separation-of-duties; auto-accept is a second opt-in

Eng review summary

5 architecture issues + 2 code quality + 2 performance, all resolved. Highlights of the resolutions baked into the ADR:

  • Persistent sealed privkey on node (survives restart) instead of in-memory-only
  • T1 ("malicious server") kept broad with Phase 4.5 code-integrity controls instead of narrowing the threat model
  • Per-node opt-in for the new flow; manual-accept default
  • 16 KB ciphertext cap rejected at both HTTP layer and WSS forward
  • supported_features advertisement on WSS handshake (definitive feature detection, not timeout polling)
  • CryptoBundle typed sub-document on NodePendingCredential instead of four loose fields
  • Reserved error codes 8004–8009
  • First-push-wins atomicity + per-pending rate limit (1 successful / 3 failed / 60s window / 5min lockout)
  • Explicit Test Strategy section: 25 tests enumerated by layer, including cross-language interop fixtures

See the doc's "GSTACK REVIEW REPORT" section at the end for the full record.

Implementation phases

Phase Scope Effort (human / CC)
1 Backend protocol stubs + error codes 1d / 2h
2 Node crypto + sealed privkey persistence 3d / 4h
3 Frontend UI + browser crypto 3d / 4h
3.5 Multi-node fan-out (per-node ephemeral pubkeys, all-N semantics) 2d / 4h
4 Audit + observability (metadata-only events) 1d / 2h
4.5 Code-integrity infra: SRI + signed release channel + verification UX 1w / 1d
5 CLI parity (accept-remote subcommand for headless rotation) 2d / 3h
6 Hint rewrites in service show / node-credential push <1d / 30m

Worktree parallelization mapped in the doc — Phase 1 alone first, then 2 + 3 + 4 in parallel lanes, then 3.5, then 4.5 + 5 + 6 in parallel.

What's NOT in scope (for the implementation that follows this ADR)

  • HSM-backed node keypairs
  • Mobile native client (mobile browser works)
  • Replacing the legacy CLI two-party flow (kept additive)
  • Verifying decrypted secrets against downstream services pre-accept

Test plan for this PR

This PR is a single Markdown file. There's no code to test — but the ADR itself is the contract for everything that ships under #773. Reviewer's job: read the doc end-to-end, push back on anything that looks wrong, approve the protocol shape so implementation can start.

  • Threat model T1–T8 is honest and matches what the protocol actually defends
  • Protocol primitives (X25519, HKDF-SHA256, XChaCha20-Poly1305) are appropriate
  • Multi-node fan-out semantics make sense for ChronoAI's actual multi-node setup
  • Phase 4.5 (code-integrity) is realistic given the team's release-pipeline state
  • Test Strategy enumeration is sufficient for implementer to start without ambiguity
  • Open Questions in the doc don't block ADR sign-off (they're implementation-phase questions)

Related

@kaiweijw

Copy link
Copy Markdown
Collaborator Author

Follow-up commit 85b4837 docs(adr): apply Grok review notes — applies the 7 non-blocking review items.

What changed

# Note Resolution
1 Long-lived sealing key cross-ref Replaced "crypto/aes.rs style" with concrete reference to cli/src/node/secret_backend.rs::SecretBackend and its sibling methods (store_auth_token, store_signing_secret, store_credential_value). No ambiguity for the Phase 2 implementer.
2 WSS frame classification New §"WSS frame classification" clarifies the new frames are node-control protocol (same class as node_metrics, proxy_request) and bypass ws_frame_injections rules. Implementer points at node_ws.rs not ws_frame_injector.rs.
3 Standalone credential-accept page New Phase 4.5 subsection: strict CSP, no main SPA bundle, ~5 KB form shim, page HTML hash itself in the signed manifest.
4 releases.nyxid.dev runbook items New table enumerating 6 infra questions to answer pre-Phase-4.5 (GPG key holder, rotation, hosting, manifest pipeline, signature scheme, expiry). Implementer documents chosen answers in docs/RELEASE_INTEGRITY.md.
5 AGENTS.md vs CLAUDE.md Factual error caught. Error code listing actually lives at AGENTS.md:85, not CLAUDE.md. Phase 1 update target corrected.
6 Strip GSTACK REVIEW REPORT Removed from committed ADR. Eng-review audit trail preserved here in the PR (commit 9e4414b body + this thread) and in git history. ADR is now pure design content.
7 Browser crypto floor Sharpened Open Question: sample real admin-population analytics before Phase 3, report X% native / Y% noble-polyfill, consider lazy-loading polyfill on miss.

Eng-review audit trail (moved from the ADR per item 6)

The ADR was eng-reviewed via /plan-eng-review on this branch: 9 issues raised, 9 resolved, 0 unresolved, 0 critical gaps. Findings + decisions are captured in the commit history:

  • 9e4414b docs(adr): remote credential injection design for #773 (eng-reviewed) — initial draft with all 9 resolutions baked in
  • 9b132a7 docs(adr): add review feedback and best practices to remote credential injection ADR — reviewer additions (local sweep, fan-out idempotency, queue limits, MFA gate)
  • d10f7d6 docs(adr): reconcile review-feedback section with phase 3.5 and error codes — internal consistency pass
  • 85b4837 docs(adr): apply Grok review notes to remote credential injection ADR — this commit

3 scope expansions were folded during review: Phase 3.5 (multi-node fan-out), Phase 4.5 (code-integrity infrastructure including the new standalone-page subsection), and an explicit Test Strategy section. Eng review verdict: CLEAR — ready to implement.

Final ADR is 472 lines. Reviewers: read the file end-to-end and push back on anything that looks wrong before implementation starts.

kaiweijw added a commit that referenced this pull request May 22, 2026
…s-model audit

Two background agents (codex/gpt-5.5 and opencode/glm-5.1) independently
reviewed PR #822 and surfaced findings the /plan-eng-review missed.
Cross-model agreement on the P0; the other 13 items are confirmed
against actual code.

P0 — Error code collision (both agents):
  The ADR reserved 8004-8009 for remote credential injection but
  backend/src/errors/mod.rs:374-375 already assigns 8004 to
  NodeCredentialMissing and 8005 to WsProxyDownstream. Implementing
  the ADR as written would have either duplicated public error codes
  or required a breaking renumber of live errors. Shift the RCI range
  to 8006-8011 with the following assignments:
    8006 PendingCredentialDecryptFailed
    8007 PendingCredentialVersionUnsupported
    8008 PendingCredentialCiphertextTooLarge
    8009 PendingCredentialPubkeyAwaiting (renamed from PubkeyNotPosted
         to reflect the polling-with-backoff semantics, not a terminal
         "not posted" state)
    8010 PendingCredentialNodeOffline
    8011 PendingCredentialQueueFull
  Test rows and failure mode table updated to reference the new
  numbers.

P1 — T1 threat-model overclaim (codex):
  G6 and the T1 row claim protection against a fully-compromised
  NyxID server via Phase 4.5. But the standalone HTML, SRI hashes,
  fingerprint display, and verify button are all served by NyxID
  itself. A fully-compromised server can substitute them in lockstep.
  Downgrade T1 wording to make the operational dependency explicit:
  Phase 4.5 provides "detection assuming the admin verifies the
  fingerprint out-of-band against the signed manifest at a separate
  origin." Without admin verification, T1 degrades to T2 in practice.

P2 — supported_features vs per-pending pubkey-readiness contradiction
(codex):
  Section said "No timing guesses, no polling fallback" while error
  8007 said "Frontend polls or falls back." Split into two distinct
  signals: feature detection (sync, cached, via supported_features
  set on the Node record) and per-pending pubkey readiness (async,
  polling with exponential backoff up to 30s, error code 8009
  PendingCredentialPubkeyAwaiting). Feature detection is fast;
  polling is bounded to nodes that have already advertised crypto_v1.

P2 — /ciphertext sync semantics underspecified (codex):
  Sequence diagram showed 4xx on node decrypt failure but didn't
  specify how the HTTP handler waits for the node. Reuse the
  existing send_credential_update_and_wait pattern at
  node_ws_manager.rs:1558 (allocate request_id, register
  oneshot waiter, send WSS frame, await ack with timeout). New
  section "Sync vs async response semantics" specifies the
  status-code mapping for consume / decrypt_failed / timeout / offline.

P2 — Frame name correction (codex):
  Sequence diagram used pending_credential_available (singular, with
  metadata). Actual code at ws_client.rs:433 sends
  pending_credentials_available (plural, no metadata) as a nudge;
  node pulls details via existing GET /api/v1/node-agent/pending-credentials.
  Updated the diagram to reuse the existing nudge/pull pattern with
  pending_credential_pubkey added as a new WSS frame the node posts
  per pending credential lacking a sealed privkey.

P2 — Ciphertext encoding for JSON WSS (opencode):
  ciphertext: Vec<u8> with no encoding spec — JSON WSS frames need
  base64url. Added explicit doc-comment in the CryptoBundle struct
  noting the base64url-string-on-the-wire / BSON-binary-in-MongoDB
  serde-adapter pattern.

P2 — supported_features doesn't exist yet (opencode):
  Phase 1 work; ADR was writing about it as if it already existed.
  Added "(new field, added in Phase 1)" annotation.

P2 — AGENTS.md actual-state correction (opencode):
  The AGENTS.md bullet "Error codes 8000-8003 are reserved" is itself
  out of date. ADR now acknowledges this explicitly and tells
  implementer to update AGENTS.md to reflect 8000-8005 (existing)
  plus 8006-8011 (new RCI range).

P3 cleanups:
  - findAndModify -> find_one_and_update (idiomatic for the Rust driver)
  - srl_* -> sri_* typo in test names (3 occurrences)
  - ChronoAIProject/issues/ -> ChronoAIProject/NyxID/issues/ in
    reference links
  - Per-pending rate limit mechanism specified (in-memory
    DashMap<PendingId, RateLimitState>, TTL-ephemeral, distinct from
    global RATE_LIMIT_PER_SECOND middleware)
  - Standalone HTML weight target loosened from ~5 KB to ~30 KB
    gzipped (more realistic given @noble/curves x25519 alone is ~6 KB
    minified) with lazy-load fallback recommendation when
    SubtleCrypto is available natively
  - Multi-node fan-out: explicit behavior when expires_at fires
    during partial_decrypted (logical credential goes to expired,
    previously-accepted node credentials are NOT rolled back per
    idempotency principle)

No protocol changes. No threat-model changes (only one wording
clarification for T1's operational dependency). All edits are
internal-consistency, encoding/transport spec sharpening, and
naming/reference corrections.

Calibration note: /plan-eng-review missed the P0 because it trusted
the AGENTS.md "8000-8003 reserved" line at face value instead of
grepping backend/src/errors/mod.rs directly. Cross-model independent
review with code-level verification caught it. Worth adopting as a
default check before locking error-code ranges in future ADRs.
@kaiweijw

Copy link
Copy Markdown
Collaborator Author

Follow-up commit 67fe444 docs(adr): fix P0 error code collision + 13 review findings from cross-model audit — applies findings from an independent cross-model review.

How this got caught

Two background agents (codex/gpt-5.5 xhigh, opencode/glm-5.1) ran review passes against this PR independently. Both surfaced the same P0 plus a long tail of P2/P3 items that my earlier /plan-eng-review missed. Cross-model agreement on the P0 was strong signal worth acting on immediately.

P0 — Error code collision

The ADR reserved 8004-8009 for remote credential injection errors. Verified against backend/src/errors/mod.rs:374-375:

NodeCredentialMissing => 8004  ← already in use
WsProxyDownstream     => 8005  ← already in use

Both AGENTS.md:85 ("Error codes 8000-8003 are reserved") and this ADR were wrong about the range being free. Shifted the RCI range to 8006-8011 with the same 6 names (also renamed PendingCredentialPubkeyNotPostedPendingCredentialPubkeyAwaiting to reflect the polling-with-backoff semantics from finding 3). ADR now explicitly documents AGENTS.md's stale state so Phase 1 fixes both.

Other findings (all confirmed against code, fixed)

# Finding Severity Source
1 Error code collision (above) P0 both agents
2 T1 threat model overclaims — NyxID serves the SRI/HTML/verify UX so a malicious server can substitute it all in lockstep; Phase 4.5 only helps if admin verifies out-of-band P1 codex
3 supported_features "no polling fallback" contradicts error 8007's "polls or falls back" P2 codex
4 /ciphertext sync semantics underspecified — should reuse send_credential_update_and_wait pattern from node_ws_manager.rs:1558 P2 codex
5 Frame name: ADR used pending_credential_available singular with metadata; actual code at ws_client.rs:433 uses pending_credentials_available plural with no metadata + HTTP pull P2 codex
6 ciphertext: Vec<u8> traversing JSON WSS frames needs explicit base64url encoding spec P2 opencode
7 supported_features field doesn't exist yet — Phase 1 work, but ADR wrote about it as present P2 opencode
8 AGENTS.md line-85 text itself is stale (8004-8005 exist beyond the documented 8000-8003 range); ADR's reference is also fragile P2 opencode
9 findAndModify is the MongoDB concept; the Rust driver method is find_one_and_update P3 opencode
10 srl_* typo for sri_* in test names (3 occurrences) P3 codex
11 Reference links ChronoAIProject/issues/ should be ChronoAIProject/NyxID/issues/ P3 codex
12 Per-pending rate limit mechanism not specified P3 opencode
13 Standalone HTML 5 KB target is unrealistic (@noble/curves x25519 alone is ~6 KB); loosen to ~30 KB gzipped with lazy-load fallback P3 opencode
14 Multi-node fan-out has no expiry story when partial_decrypted hits expires_at P3 opencode

Calibration

/plan-eng-review missed the P0 because it accepted AGENTS.md's "8000-8003 reserved" text without grepping the actual error-code source-of-truth in backend/src/errors/mod.rs. Both background agents caught it by reading the code first. Worth adopting "grep the code, not the docs, before locking error-code ranges" as a default check in future ADR reviews.

Diff stats

  • 1 file changed, +64/-31 lines
  • ADR is now 505 lines (was 472 pre-fix)
  • No protocol changes, no threat-model changes (only T1 wording clarification)

@kaiweijw kaiweijw marked this pull request as draft May 25, 2026 03:57
kaiweijw added 6 commits May 25, 2026 12:09
Design document for #773 (org-admin remote credential injection without
SSH-to-node). Status: proposed, eng-reviewed via /plan-eng-review with
9 issues raised and resolved, 0 unresolved, 0 critical gaps.

Protocol shape: browser-driven end-to-end encrypted relay using X25519
ECDH + HKDF-SHA256 + XChaCha20-Poly1305. NyxID server is strictly a
passthrough — never derives keys, never sees plaintext. Node generates
an ephemeral X25519 keypair per pending credential, sealed at rest by
the node's long-lived auth key (survives restart, dropped on consume).

Eng review folded three scope expansions into the design:
- Multi-node fan-out with all-N semantics (Phase 3.5)
- Frontend code-integrity infrastructure: SRI hashes, signed release
  channel, admin verification UX (Phase 4.5)
- Explicit Test Strategy section enumerating 25 mandatory tests by
  layer (node crypto, backend endpoints, frontend, cross-language
  interop fixtures, e2e, audit, regression, eval)

Hardened design decisions captured:
- Persistent sealed privkey on node (survives restart) instead of
  in-memory-only
- Manual-accept gate is the default; auto-accept requires a second
  opt-in (preserves separation-of-duties for existing orgs)
- First-push-wins atomicity + per-pending rate limit
  (1 successful POST, 3 failed in 60s, 5min lockout)
- 16 KB ciphertext cap, error codes 8004-8009 reserved
- CryptoBundle as a typed sub-document on NodePendingCredential
- `supported_features` set on Node for version detection

Closes the design phase of #773. Implementation tracked in subsequent
PRs against the phases enumerated in the doc.

Tracking: #769.
… codes

The "Design Review Feedback & Best Practices" section added in 9b132a7
introduced refinements that didn't fully align with the existing spec
elsewhere in the document. Reconcile:

- Phase 3.5 (Multi-node fan-out) step 6 previously said "Decrypt failure
  on any one node bricks the logical credential" — which directly
  contradicted Best Practice §2's retry-only-failed-nodes semantics.
  Rewrite steps 5 and 6 to describe the retry path explicitly:
  partial_decrypted is a recoverable state, frontend re-runs Phase 3
  against the failed subset, consumed is reached when all N have
  accepted.

- The test row e2e_multi_node_partial_failure was renamed to
  e2e_multi_node_partial_failure_then_retry and updated to assert the
  recovery path including idempotency of previously-accepted nodes.

- Error code 8009 (previously "(reserved)") is now formally assigned to
  PendingCredentialQueueFull — the natural use case for the 5-per-node
  ciphertext queue cap described in Best Practice §3.

- Best Practice §1: replace the hardcoded "1 hour" TTL with a reference
  to NodePendingCredential.expires_at so the local sweep stays aligned
  with the server-side TTL if the server default ever changes.

- Best Practice §4: pin the mechanism to MFA confirmation explicitly,
  since NyxID already has the /auth/mfa/verify endpoint. Note why
  multi-admin approval is not used (no existing approval-queue
  subsystem; would be a separate project).

All five edits are internal-consistency / clarification only — no
protocol or threat-model changes.
Seven non-blocking review notes from Grok addressed:

1. Privkey sealing — replace the vague "crypto/aes.rs style" reference
   with a concrete cross-reference to the existing
   cli/src/node/secret_backend.rs::SecretBackend trait, naming the
   sibling methods (store_auth_token, store_signing_secret,
   store_credential_value) so the Phase 2 implementer has no ambiguity
   about which mechanism to reuse.

2. WSS frame classification — add a one-paragraph clarification that
   the new pending_credential_pubkey / pending_credential_ciphertext
   frames are node-control protocol traffic (same class as
   node_metrics, proxy_request, etc.) and bypass the
   ws_frame_injections rules on DownstreamService / UserService.
   Implementation note points to node_ws.rs vs ws_frame_injector.rs.

3. Standalone credential-accept page — add a new subsection to
   Phase 4.5 specifying the page is served as minimal standalone HTML
   (strict CSP, no main SPA bundle, ~5 KB form shim) to reduce the
   substitution surface, and that the page's HTML hash is itself part
   of the signed release manifest.

4. Phase 4.5 runbook items — new table enumerating the infra
   questions that must be answered before the phase starts: GPG signing
   key holder, rotation procedure, releases.nyxid.dev hosting,
   manifest publish pipeline, signature scheme, expiry. Implementer
   documents the chosen answers in docs/RELEASE_INTEGRITY.md.

5. AGENTS.md vs CLAUDE.md correction — Grok caught a factual error:
   the error-code listing lives at AGENTS.md:85, not in CLAUDE.md.
   Phase 1 update target corrected.

6. Strip GSTACK REVIEW REPORT table — remove the process artifact from
   the committed ADR (eng-review summary moves to PR description and
   commit history). Keeps the ADR a pure design doc for future
   readers.

7. Browser crypto floor — sharpen the open question to call for real
   admin-population analytics before Phase 3, with a concrete metric
   format and a lazy-load suggestion if SubtleCrypto coverage is
   overwhelmingly high.

No protocol or threat-model changes. All edits are clarifications,
operational hardening, and doc hygiene.
…s-model audit

Two background agents (codex/gpt-5.5 and opencode/glm-5.1) independently
reviewed PR #822 and surfaced findings the /plan-eng-review missed.
Cross-model agreement on the P0; the other 13 items are confirmed
against actual code.

P0 — Error code collision (both agents):
  The ADR reserved 8004-8009 for remote credential injection but
  backend/src/errors/mod.rs:374-375 already assigns 8004 to
  NodeCredentialMissing and 8005 to WsProxyDownstream. Implementing
  the ADR as written would have either duplicated public error codes
  or required a breaking renumber of live errors. Shift the RCI range
  to 8006-8011 with the following assignments:
    8006 PendingCredentialDecryptFailed
    8007 PendingCredentialVersionUnsupported
    8008 PendingCredentialCiphertextTooLarge
    8009 PendingCredentialPubkeyAwaiting (renamed from PubkeyNotPosted
         to reflect the polling-with-backoff semantics, not a terminal
         "not posted" state)
    8010 PendingCredentialNodeOffline
    8011 PendingCredentialQueueFull
  Test rows and failure mode table updated to reference the new
  numbers.

P1 — T1 threat-model overclaim (codex):
  G6 and the T1 row claim protection against a fully-compromised
  NyxID server via Phase 4.5. But the standalone HTML, SRI hashes,
  fingerprint display, and verify button are all served by NyxID
  itself. A fully-compromised server can substitute them in lockstep.
  Downgrade T1 wording to make the operational dependency explicit:
  Phase 4.5 provides "detection assuming the admin verifies the
  fingerprint out-of-band against the signed manifest at a separate
  origin." Without admin verification, T1 degrades to T2 in practice.

P2 — supported_features vs per-pending pubkey-readiness contradiction
(codex):
  Section said "No timing guesses, no polling fallback" while error
  8007 said "Frontend polls or falls back." Split into two distinct
  signals: feature detection (sync, cached, via supported_features
  set on the Node record) and per-pending pubkey readiness (async,
  polling with exponential backoff up to 30s, error code 8009
  PendingCredentialPubkeyAwaiting). Feature detection is fast;
  polling is bounded to nodes that have already advertised crypto_v1.

P2 — /ciphertext sync semantics underspecified (codex):
  Sequence diagram showed 4xx on node decrypt failure but didn't
  specify how the HTTP handler waits for the node. Reuse the
  existing send_credential_update_and_wait pattern at
  node_ws_manager.rs:1558 (allocate request_id, register
  oneshot waiter, send WSS frame, await ack with timeout). New
  section "Sync vs async response semantics" specifies the
  status-code mapping for consume / decrypt_failed / timeout / offline.

P2 — Frame name correction (codex):
  Sequence diagram used pending_credential_available (singular, with
  metadata). Actual code at ws_client.rs:433 sends
  pending_credentials_available (plural, no metadata) as a nudge;
  node pulls details via existing GET /api/v1/node-agent/pending-credentials.
  Updated the diagram to reuse the existing nudge/pull pattern with
  pending_credential_pubkey added as a new WSS frame the node posts
  per pending credential lacking a sealed privkey.

P2 — Ciphertext encoding for JSON WSS (opencode):
  ciphertext: Vec<u8> with no encoding spec — JSON WSS frames need
  base64url. Added explicit doc-comment in the CryptoBundle struct
  noting the base64url-string-on-the-wire / BSON-binary-in-MongoDB
  serde-adapter pattern.

P2 — supported_features doesn't exist yet (opencode):
  Phase 1 work; ADR was writing about it as if it already existed.
  Added "(new field, added in Phase 1)" annotation.

P2 — AGENTS.md actual-state correction (opencode):
  The AGENTS.md bullet "Error codes 8000-8003 are reserved" is itself
  out of date. ADR now acknowledges this explicitly and tells
  implementer to update AGENTS.md to reflect 8000-8005 (existing)
  plus 8006-8011 (new RCI range).

P3 cleanups:
  - findAndModify -> find_one_and_update (idiomatic for the Rust driver)
  - srl_* -> sri_* typo in test names (3 occurrences)
  - ChronoAIProject/issues/ -> ChronoAIProject/NyxID/issues/ in
    reference links
  - Per-pending rate limit mechanism specified (in-memory
    DashMap<PendingId, RateLimitState>, TTL-ephemeral, distinct from
    global RATE_LIMIT_PER_SECOND middleware)
  - Standalone HTML weight target loosened from ~5 KB to ~30 KB
    gzipped (more realistic given @noble/curves x25519 alone is ~6 KB
    minified) with lazy-load fallback recommendation when
    SubtleCrypto is available natively
  - Multi-node fan-out: explicit behavior when expires_at fires
    during partial_decrypted (logical credential goes to expired,
    previously-accepted node credentials are NOT rolled back per
    idempotency principle)

No protocol changes. No threat-model changes (only one wording
clarification for T1's operational dependency). All edits are
internal-consistency, encoding/transport spec sharpening, and
naming/reference corrections.

Calibration note: /plan-eng-review missed the P0 because it trusted
the AGENTS.md "8000-8003 reserved" line at face value instead of
grepping backend/src/errors/mod.rs directly. Cross-model independent
review with code-level verification caught it. Worth adopting as a
default check before locking error-code ranges in future ADRs.
@kaiweijw kaiweijw force-pushed the docs/issue-773-remote-credential-injection branch from 67fe444 to 8ddea80 Compare May 25, 2026 04:15
kaiweijw added 6 commits May 25, 2026 14:35
…+ Grok)

Three agents independently reviewed PR #822. Grok gave LGTM after
exhaustive code cross-referencing; Codex and Opencode surfaced 14
specific items (2 P1, 5 P2, 5 P3, 2 informational). All addressed:

P1 — Status field missing on model (Codex):
  CryptoBundle had no state tracking for the lifecycle the protocol
  describes. Add RemoteCryptoState enum (PubkeyPosted, CiphertextReceived,
  CiphertextQueued, Consumed, DecryptedPendingConfirmation, DecryptFailed,
  Expired) plus remote_state field on NodePendingCredential. For multi-node
  fan-out, add FanOutNodeState subdocument with per-node crypto + state.

P1 — Offline handling contradicts sync POST (Codex):
  Error code 8010 said "queued" but sync section said 503. Resolved by
  supporting both: node online = sync wait (200/4xx/504 timeout); node
  offline at POST time = store ciphertext with 15-min queue TTL, return
  202 with remote_state CiphertextQueued, forward on reconnect. The two
  paths are distinguished because the server knows node connectivity
  before attempting the WSS send.

P1 — T1 + auto-accept tension (Opencode):
  Auto-accept means admin doesn't pause to verify fingerprints. Added
  security tradeoff note and recommendation for strict orgs.

P2 — Feature flag visibility (Codex):
  Frontend checked crypto_v1 but couldn't tell if
  enable_remote_credential_injection was actually true. Now the
  detection section specifies three UI states: crypto-capable + enabled,
  crypto-capable + disabled (with enablement hint), and legacy.

P2 — G6 + Phase 4.5 overclaim (Codex):
  G6 now says "Detection of malicious code-substitution... assuming
  admin independently verifies." Phase 4.5 intro aligned to match.

P2 — 202 response novelty note (Opencode):
  Added RFC 7231 reference + explicit frontend handling note.

P2 — Auto-accept migration warning (Opencode):
  One-time banner UX when feature first enabled.

P3 — Primitives table encoding (Codex):
  "raw bytes" → "base64url" for ciphertext in JSON transports.

P3 — WSS classification frame name (Codex):
  Singular pending_credential_available → plural pending_credentials_available.

P3 — Failure table pubkey-awaiting (Codex):
  "legacy node" → "crypto_v1 node, async pubkey delay" with correct
  error code 8009 and polling semantics.

P3 — Queue TTL vs metadata TTL (Opencode):
  Added one-sentence clarification of independence.

P3 — Phase 3 effort estimate (Opencode):
  Bumped from 3d/4h to 4d/6h given expanded push form scope.

Plus 4 new tests: RemoteCryptoState enum serde roundtrip, offline node
returns 202 queued, queued ciphertext forwarded on reconnect, queued
ciphertext expired after TTL.

ADR is now 601 lines. Grok's LGTM + Codex/Opencode's corrections
together give strong cross-model confidence in the design.
The CLI should support the entire remote credential injection flow
natively — not just "push intent then switch to browser." The admin's
locally-installed binary is immune to T1 code-substitution by design
(no NyxID-served JS to substitute), making it the strongest trust
anchor for security-conscious orgs.

New `nyxid node-credential inject` subcommand:
  - Interactive mode: push + poll pubkey + prompt secret + encrypt +
    post ciphertext + wait for result — one command, full flow
  - Non-interactive mode: --secret-env reads from env var for CI /
    cron / rotation scripts
  - Org support: --org flag follows existing UUID/slug/name convention
  - Fallback: existing `push` command now prints BOTH browser URL and
    `inject --pending <id>` CLI command

Shared crypto crate: Phase 2 (node decrypt) and Phase 5 (CLI encrypt)
use the same x25519-dalek + chacha20poly1305 + hkdf + sha2 stack.
Extract into nyxid-crypto workspace crate (~400 LOC) following the
existing nyxid-cloud-auth precedent. Exposes encrypt() and decrypt()
with AAD context binding. Browser @noble/* and Rust crate must produce
identical output — interop fixtures verify this.

Phase 5 effort bumped from 2d/3h to 3d/4h to reflect the full inject
command + shared crate extraction. Test strategy adds 10 new tests:
5 for the shared crate, 5 for the inject command, plus 4 CLI-specific
e2e tests covering interactive, --secret-env, --org, and the push
fallback that prints both paths.

Security note: for orgs that care about T1, the CLI path is the
recommended approach — no SRI/fingerprint verification needed because
the binary itself is the trust anchor.
When an admin works alongside an AI coding agent (Claude Code, Codex,
OpenClaw, etc.), the agent has full terminal visibility. A masked
"Enter secret value:" prompt in interactive mode would still expose
the secret through the terminal session the agent can read.

New --browser flag on nyxid node-credential inject:
  1. CLI creates the pending credential via the API
  2. Opens the default browser to the standalone accept page
  3. Admin enters the secret in the browser (agent cannot see it)
  4. Browser encrypts + submits (Phase 3 e2e crypto)
  5. CLI polls pending state until consumed/failed/expired
  6. CLI prints result — secret never appears in terminal transcript

Follows the existing CLI wizard pattern from CLI_WIZARD_V3.md, already
used for OAuth device-code flows.

The inject command now has three modes:
  - Interactive (default): terminal prompt, direct human use
  - --secret-env VAR: env var, CI / automation
  - --browser: wizard, AI-agent-safe

The push fallback now prints all three continuation paths (browser URL,
inject --pending, inject --pending --browser) so the admin can pick.

Added 3 CLI tests (wizard opens URL + polls, secret not in transcript,
push prints all paths) and 2 e2e tests (browser wizard flow, push
prints all paths).
Round 3 review by Codex (gpt-5.5), Opencode (glm-5.1), and Grok
(grok-build). Grok gave LGTM again after delta review. Codex found
1 P1 + 2 P2 + 3 P3. Opencode found 2 P2 + 2 P3. All 9 addressed:

P1 — CLI T1 claim overstated (Codex):
  CLI is immune to code-substitution but NOT to data-substitution.
  The node pubkey fetched from NyxID can be replaced by a compromised
  server (classic MITM on the key exchange). AEAD AAD binds metadata
  but not the pubkey's origin. Downgrade the claim to "immune to T1
  code-substitution; still vulnerable to T1 data-substitution like
  the browser path." Add an operational note about out-of-band pubkey
  fingerprint verification and a future-work pointer to node-signed
  pubkeys via TOFU-pinned Ed25519 identity.

P2 — AAD too narrow (Codex):
  AAD now includes injection_method, field_name, and target_url in
  addition to the existing node_id, pending_id, slug, version. T5
  threat row updated to reflect metadata-tampering protection. A
  malicious relay that alters where the secret goes after submission
  now causes an AEAD tag failure on the node.

P2 — PartialDecrypted missing from enum (Codex):
  Added to RemoteCryptoState for the multi-node fan-out aggregate
  state. Per-node states are in FanOutNodeState; the logical pending
  now has an explicit state for "some nodes accepted, some failed."

P2 — 504 CiphertextReceived polling guidance (Opencode):
  After a sync-wait timeout (504), the admin should poll with
  exponential backoff watching for state transitions. Surface a
  "node may be unresponsive" warning after 2 minutes.

P2 — CiphertextQueued in state diagram (Opencode):
  State transition diagram now shows the offline → queued → reconnect
  → consumed/expired branch alongside the online path.

P3 — Consumed state tightened (Codex):
  Comment now says "terminal success: decrypted AND stored." The
  intermediate operator-confirm state is DecryptedPendingConfirmation,
  not Consumed. Prevents frontend/cleanup confusion.

P3 — Interop test wording (Codex):
  "identical ciphertext" → "mutually decryptable." With random nonces
  identical plaintext will not produce identical ciphertext (by
  design). Interop fixtures use fixed test vectors (pinned keys +
  nonce) for byte-level verification.

P3 — node config set command (Codex):
  The ADR referenced a `nyxid node config set` CLI subcommand that
  doesn't exist. Updated to point at the web UI node settings page
  and PATCH /api/v1/nodes/{id} as the enablement paths, with a note
  that Phase 1 should add the CLI subcommand or the API + web UI are
  the only paths.

P3 — Feature-gate decrypt() in nyxid-crypto (Opencode):
  Backend depends on nyxid-crypto for CiphertextEnvelope + validation
  only. decrypt() is behind cfg(feature = "decrypt") so the backend
  binary cannot call it. Only nyxid-cli enables the feature.
…routes

Round 4 cross-model review (Codex + Opencode + Grok). Grok LGTM again.
Codex and Opencode found 6 text-consistency items (no protocol changes):

1. Top-level context + G2 now explicitly scope the confidentiality
   claim to T2 (passive compromise) and reference the T1 pubkey-MITM
   caveat. No more "strictly preserves" language without qualification.

2. T1 data-substitution caveat now lists the full 7-field AAD (was
   listing only 4 — a stale reference from before the AAD expansion).

3. New section documenting why HKDF info (4 fields) and AEAD AAD
   (7 fields) intentionally differ: HKDF for domain separation, AAD
   for integrity. Includes the canonical wire format (length-prefixed
   u16_be fields) and the Option<String> sentinel convention (None →
   empty string → u16_be(0)) to prevent downgrade attacks.

4. require_operator_confirm_for_remote reframed as separation-of-duties
   control, not T1 confidentiality defense. The gate runs after secret
   submission — too late to prevent pubkey-MITM exfiltration.

5. API routes unified: CLI section now uses
   /nodes/{node_id}/credentials/pending/{pending_id} (matching the
   sequence diagram and existing routes.rs at lines 545-546), not the
   shorter /credentials/pending/{id} which doesn't exist.

6. PartialDecrypted documented as fan-out aggregate-only state with
   explicit MUST NOT appear note for FanOutNodeState.remote_state.
   State diagram note added.
…ended in v1'

Codex repeatedly flagged (rounds 3-5) that the threat table presented
T1 and T3 as "defended" while the protocol text admitted the pubkey-
substitution MITM defeats both. Opencode and Grok said ship, but
Codex's structural point is valid: a security reviewer reads the
threat table first, and "defended" there contradicts the caveats
deeper in the doc.

Split the table into two sections:

"Adversaries defended in v1" — T2, T4, T5, T6, T7, T8. These are
fully defended by the e2e crypto, AAD binding, atomicity, forward
secrecy, rate limiting, and sealed privkey persistence. T2 (passive
read) is now explicitly marked as the primary confidentiality
guarantee.

"Acknowledged threats — NOT defended in v1" — T1 and T3. Both share
the same key-exchange MITM weakness. Each row now explains WHY it's
not defended (pubkey substitution), what mitigation exists TODAY
(out-of-band fingerprint verification), and what the future fix is
(node-signed ephemeral pubkeys via TOFU-pinned Ed25519 identity).

G6 narrowed from "detection of malicious code-substitution" to
explicitly "JS code-substitution only, not pubkey substitution."
@kaiweijw

kaiweijw commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

📊 当前状态 — ADR 独立安全审计已派(不需要人介入)

维度
对象 本 PR 的 ADR(远程凭证注入 E2E 加密协议)
动作 一个独立 auditor codex 正在读 ADR + 对照真实 NyxID 代码,审 威胁模型 / 原语 / AAD 绑定 / key 生命周期 / "server 永不见明文" 不变量 / 向后兼容 / 错误码完整性
产出 审计结论将作为评论回贴本 PR
说明 本 PR 为纯文档 ADR;合入 main 仍由 maintainer 决定(本循环不自动合 main)
是否需要人介入 ❌ 否(等审计结论)

🤖 controller status banner

⟦AI:AUTO-LOOP⟧

@kaiweijw

kaiweijw commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 ADR #822 Remote Credential Injection 独立安全/设计审计

TL;DR

  • 是什么:我以只读方式审阅了 PR docs(adr): remote credential injection design for #773 (eng-reviewed) #822 的 ADR、issue [Feature] Remote credential injection path for org admins (no SSH-to-node required) #773 / PR 元数据,以及当前 codebase 中 NodePendingCredential、pending credential handlers、node WSS、SecretBackend、error-code 路径;当前 pending model 仍是 metadata-only,没有 secret 字段(backend/src/models/node_pending_credential.rs:26-47)。
  • 结论:ADR 对 T2/passive compromise 的核心机密性边界基本成立,且已诚实地把 T1/T3 标为 v1 不防御(ADR §"Threat model" lines 51-56)。但我发现 10 个问题,其中 1 个 blocking:offline ciphertext queue 的 15 分钟 TTL 在单节点数据模型里没有持久化字段可执行(ADR §"Data model" lines 181-255;ADR §"Design Review Feedback & Best Practices" lines 707-709)。
  • 下一步:先修 F1;F2 至少要在 Phase 3.5 前修;F3-F8 建议并入 Phase 1/2 的验收条件(证据和建议见下表)。
ID severity 证据 建议
F1 blocking ADR 要求 queued ciphertext 使用独立 15 分钟 TTL,但 CryptoBundle / single-node NodePendingCredential 没有 ciphertext_queued_atciphertext_expires_at 字段;当前 model 也没有可用字段(ADR §"Data model" lines 181-255;ADR §"Sync vs async response semantics" line 377;backend/src/models/node_pending_credential.rs:26-47)。 增加 queue timestamp / expiry 字段、索引和 sweeper 规则;fan-out per-node 状态也要同样可执行。
F2 high ADR 说 multi-node fan-out 面向 fallback_node_ids,但当前 NodeRoute 只返回 viable online / WS-connected nodes(ADR §"Multi-node fan-out" lines 349-359;backend/src/services/node_routing_service.rs:21-24:154-165)。 不要把 proxy failover 的 NodeRoute 当作 fan-out 目标集;定义 all intended active bindings resolver,再对每个 node 做 online delivery 或 offline queue。
F3 medium ADR 写 pending IDs 约 256 bits entropy;当前 pending ID 是 Uuid::new_v4(),约 122 random bits(ADR §"Threat model" line 48;backend/src/services/node_pending_credential_service.rs:60-62)。 改文档为 UUID v4 约 122 bits,或改协议为 32-byte random pending IDs。
F4 medium ADR 只写 base64url,没有定义 padded / no-pad;当前 codebase 同时存在 URL_SAFE_NO_PAD 和 STANDARD base64 用法(ADR §"Cryptographic primitives" line 170;cli/src/auth.rs:7:47backend/src/handlers/node_ws.rs:149)。 明确 RCI JSON 字段使用 URL_SAFE_NO_PAD,或定义接受规则和 canonical encoder;同时验证 decoded lengths。
F5 medium AAD 绑定 injection_method,但 canonical token 未定义;pending API 使用 query-param / path-prefix,live credential_update path 使用 query_param / path_prefix(ADR §"HKDF info vs AEAD AAD" lines 174-177;cli/src/cli.rs:1555-1561cli/src/node/ws_client.rs:3036-3074)。 在 shared protocol / crypto crate 中提供 RciAadContext builder;增加固定 domain labels,并固定 injection method canonical form。
F6 medium 16 KB cap 写在 POST / model / tests 上,但 WSS forward、queued replay、fan-out aggregate body 的边界校验没有明确写入 ADR(ADR §"Data model" line 194;ADR §"Error codes" line 494;backend/src/services/node_ws_manager.rs:1537-1540)。 一个 MAX_CIPHERTEXT_SIZE 用于 HTTP decode、Mongo update、WSS frame construction、queued replay、fan-out per-node item。
F7 medium ADR 要求 consume/decline/cancel/expire 都驱逐 sealed privkey,但 cancel control-plane frame 未明确;当前 admin cancel 只更新 DB 并写 audit(ADR §"Test Strategy" lines 514-517;backend/src/handlers/node_admin.rs:637-664)。 明确 pending_credential_canceled / expired frame,或定义 polling diff cleanup;local sweep 只能作为 fallback。
F8 medium operator-confirm path 会在 node 内存里保存 decrypted secret,但 ADR 只写 “volatile ready-to-accept queue with TTL”,没有 TTL 长度、zeroization、cancel 行为(ADR §"Flow" lines 332-335;ADR §"Accept gate" lines 418-420)。 TTL 不超过 expires_at,plaintext buffer 用 Zeroizing,所有 drop/cancel/expire 路径显式 zeroize。
F9 low SRI 小节单独阅读时有过度表述:fully compromised server 可以同时替换 HTML + SRI;ADR 其他小节已正确要求 separate signed manifest + out-of-band admin verification(ADR §"Code-integrity infrastructure" line 430;ADR §"SRI hashes" line 443;ADR §"Signed release channel" lines 447-452)。 把 SRI 段落改成 “HTML 也被替换时,T1 detection 依赖 signed manifest comparison”。
F10 nit ADR 只点名 AGENTS.md stale,但 CLAUDE.md 同一条也 stale;code 已占用 8000-8005(ADR §"Error codes" lines 488-499;backend/src/errors/mod.rs:374-379AGENTS.md:85CLAUDE.md:85)。 Phase 1 同时更新两个文档,或明确一个为生成/唯一 authoritative source。
原始审计 artifact(English)
# ADR PR #822 Remote Credential Injection - Independent Security/Design Audit

Audit date: 2026-06-04  
Scope: ADR from `git show origin/docs/issue-773-remote-credential-injection:docs/REMOTE_CREDENTIAL_INJECTION.md`, issue #773, PR #822, and current NyxID checkout.  
Mode: read-only source audit; no source code changes.

## Executive Verdict

The ADR is directionally sound for its main confidentiality invariant under the stated passive-compromise model: the remote path places encryption in the admin browser/CLI and decryption on the node, while the server stores/forwards `admin_pubkey`, `nonce`, and ciphertext only (`ADR §"Goals"` lines 23-27, `ADR §"Data model"` lines 181-203, `ADR §"Shared crypto crate"` line 620). The document is also materially more honest than earlier versions: T1 and T3 are explicitly moved to "NOT defended in v1" because the server can substitute the node pubkey (`ADR §"Threat model"` lines 51-56, `ADR §"T1 data-substitution caveat"` line 97).

I found no design path where NyxID intentionally receives plaintext or derives the shared key under T2, assuming the backend keeps the `decrypt` feature disabled as specified (`ADR §"Shared crypto crate"` line 620). The current code also supports the baseline claim that today's pending credential flow stores metadata only: `NodePendingCredential` has no secret field (`backend/src/models/node_pending_credential.rs:26-47`), admin push logs only metadata (`backend/src/handlers/node_admin.rs:589-599`), and node consume/decline audit events are metadata-only (`backend/src/handlers/node_agent.rs:93-107`, `backend/src/handlers/node_agent.rs:127-145`).

However, I do not consider the ADR fully implementation-ready until the offline queue TTL is represented in the data model. The ADR requires a 15-minute ciphertext queue TTL independent from `NodePendingCredential.expires_at`, but the proposed model has no single-node queued-at / queued-expires-at field to enforce that requirement (`ADR §"Data model"` lines 181-255, `ADR §"Sync vs async response semantics"` line 377, `ADR §"Design Review Feedback & Best Practices"` lines 707-709).

## Findings Table

| ID | Severity | Finding | Evidence | Recommendation |
|---|---|---|---|---|
| F1 | blocking | Offline ciphertext queue TTL is required but not representable in the proposed single-node data model. The ADR says queued ciphertext uses a shortened 15-minute TTL independent from metadata `expires_at`, but `CryptoBundle` has only `version`, pubkeys, nonce, and ciphertext, while `NodePendingCredential` only adds `crypto`, `remote_state`, and optional `fan_out_nodes`. The current model likewise has no `updated_at` or queue-expiry field. Without a persisted queue timestamp/expiry, a sweeper cannot distinguish "queued 15 minutes ago" from "pending metadata still valid for 1 hour." | `ADR §"Data model"` lines 181-203 and 237-255; `ADR §"Sync vs async response semantics"` line 377; `ADR §"Design Review Feedback & Best Practices"` lines 707-709; `backend/src/models/node_pending_credential.rs:26-47` | Add explicit queue lifecycle fields, e.g. `ciphertext_queued_at` and `ciphertext_expires_at` on single-node pending credentials and per fan-out node, using the required BSON chrono serde helpers. Define the cleanup query/index and reconnect-forward rule against those fields. |
| F2 | high | Multi-node fan-out target selection is underspecified and conflicts with current `NodeRoute` semantics if implementers reuse `fallback_node_ids`. The ADR says fan-out applies to services with `fallback_node_ids`, but the current routing service only returns viable online/WS-connected nodes as primary/fallbacks. That is correct for proxy failover, but wrong as a credential fan-out source because offline nodes are exactly the nodes that need queued ciphertext. | `ADR §"Multi-node fan-out"` lines 349-359; `backend/src/services/node_routing_service.rs:21-24`, `backend/src/services/node_routing_service.rs:154-165`, `backend/src/services/node_routing_service.rs:129-149` | Specify a separate fan-out target resolver: all intended active node bindings for the service/owner, with per-node ACL checks, not the viability-filtered `NodeRoute`. Then apply online sync delivery or offline queue per target. |
| F3 | medium | T7 overstates pending ID entropy. The ADR says pending IDs have approximately 256 bits of entropy, but current pending IDs are UUID v4 strings generated by `Uuid::new_v4()`, which provide about 122 random bits. That is still practically unguessable, but the ADR's evidence is inaccurate. | `ADR §"Threat model"` line 48; `backend/src/services/node_pending_credential_service.rs:60-62` | Correct the ADR to "~122 bits via UUID v4" or explicitly switch RCI pending IDs to 32-byte random identifiers if 256-bit entropy is a hard requirement. |
| F4 | medium | Base64url encoding is not pinned to padded vs unpadded form. The ADR says "base64url" for keys, nonce, and ciphertext, while the current codebase uses both URL-safe no-padding encodings for JWT-like material and standard base64 for WSS binary payloads. Interop fixtures will catch this late, but implementers need the canonical rule up front. | `ADR §"Cryptographic primitives"` line 170; `ADR §"Data model"` lines 196-200; `cli/src/auth.rs:7`, `cli/src/auth.rs:47`; `backend/src/handlers/node_ws.rs:149`; `cli/src/node/ws_client.rs:1088` | Specify `base64::engine::general_purpose::URL_SAFE_NO_PAD` for all RCI JSON fields, or specify an accepting decoder plus canonical encoder. Include exact decoded lengths: 32-byte pubkeys, 24-byte nonce, `<= 16 * 1024` ciphertext+tag. |
| F5 | medium | Crypto context canonicalization needs to be centralized. The ADR binds `injection_method` in AAD but does not define the canonical string source. Current pending credential APIs use kebab-case (`query-param`, `path-prefix`), while the live WSS `credential_update` apply path uses underscore tokens (`query_param`, `path_prefix`). The remote path can be correct, but only if all encrypt/decrypt sides use the same canonical pending-credential tokens. | `ADR §"HKDF info vs AEAD AAD"` lines 174-177; `backend/src/models/node_pending_credential.rs:16-22`; `cli/src/cli.rs:1555-1561`; `cli/src/node/ws_client.rs:3007-3010`, `cli/src/node/ws_client.rs:3036-3037`, `cli/src/node/ws_client.rs:3073-3074` | Put `RciAadContext` / `RciKdfInfo` builders in the shared crypto crate or a shared protocol module. Use explicit static domain labels such as `nyxid:rci:v1:kdf` and `nyxid:rci:v1:aad`, and define `injection_method` as the pending API's kebab-case token or an enum discriminant. |
| F6 | medium | The 16 KB ciphertext cap is specified for POST, but not explicitly for WSS forward, queued replay, or fan-out aggregate bodies. Current WSS writer paths only serialize and `try_send`; the ADR's tests only name the HTTP over-size rejection. A future bug or queue replay path could forward an oversized stored blob unless the same constant is checked at every boundary. | `ADR §"Data model"` line 194; `ADR §"Flow"` line 319; `ADR §"Error codes"` line 494; `ADR §"Test Strategy"` line 545; `backend/src/services/node_ws_manager.rs:1537-1540`, `backend/src/services/node_ws_manager.rs:1583-1591` | Define one `MAX_CIPHERTEXT_SIZE` and require checks on HTTP decode, MongoDB update, WSS frame construction, queued replay, and each fan-out element. Add tests for queued replay and fan-out total/per-node limits. |
| F7 | medium | Cancel/decline/expire key eviction is required but the control-plane message is not specified. The ADR requires sealed privkey eviction on consume/decline/cancel/expire and mentions missed cancel WSS frames, but the current admin cancel endpoint only deactivates MongoDB state and writes audit metadata; it does not notify the node. | `ADR §"Test Strategy"` lines 514-517; `ADR §"Design Review Feedback & Best Practices"` lines 699-701; `backend/src/handlers/node_admin.rs:637-664`; `backend/src/services/node_pending_credential_service.rs:127-150` | Add explicit node-control frames such as `pending_credential_canceled` / `pending_credential_expired`, or specify a polling diff protocol that causes deletion. Keep local sweep as a fallback, not the only cancel cleanup path. |
| F8 | medium | The operator-confirm path creates a plaintext-on-node memory queue, but the ADR only says "volatile ready-to-accept queue with TTL" and does not define TTL length, cancellation behavior, or zeroization. This is on the trusted node, not on NyxID, so it does not break the core server invariant; it is still key lifecycle surface that should be deterministic. | `ADR §"Flow"` lines 332-335; `ADR §"Accept gate"` lines 418-420; current node secret handling uses explicit secret backends at `cli/src/node/secret_backend.rs:121-152` | Specify the queue TTL, upper-bound it by `NodePendingCredential.expires_at`, use `Zeroizing` storage for plaintext buffers, and state that cancel/decline/expire/drop paths zeroize the queue entry before removal. |
| F9 | low | The SRI section has one self-contained overstatement. It says NyxID cannot silently substitute JS without changing the HTML's SRI attribute loaded over TLS. Under the ADR's own T1 model, a fully compromised server can also change that HTML; detection depends on the separate signed manifest and out-of-band admin verification. Other sections say this correctly, so this is a wording fix, not a protocol flaw. | `ADR §"Code-integrity infrastructure"` line 430; `ADR §"SRI hashes on crypto JS bundles"` line 443; `ADR §"Signed release channel"` lines 447-452; `ADR §"Admin verification UX"` lines 456-461 | Qualify the SRI paragraph: "SRI blocks stale/tampered bundles unless the HTML is also substituted; T1 detection requires comparing the HTML/bundle hash to the signed release manifest at the separate origin." |
| F10 | nit | Error-code documentation drift exists in both `AGENTS.md` and `CLAUDE.md`, but the ADR only calls out `AGENTS.md`. The code already assigns 8000-8005, while both docs still say 8000-8003. | `ADR §"Error codes"` lines 488-499; `backend/src/errors/mod.rs:374-379`; `AGENTS.md:85`; `CLAUDE.md:85` | Update both docs or make one explicitly generated/authoritative. The ADR's 8006-8011 allocation itself does not collide with current code. |

## Core Invariant Review

Under T2/passive compromise, the ADR's plaintext boundary is credible. The browser/CLI encrypts before POST (`ADR §"Browser-based push"` lines 85-91, `ADR §"CLI path"` lines 107-114), the server stores only the `CryptoBundle` envelope (`ADR §"Data model"` lines 181-203), and the backend is explicitly forbidden from enabling `decrypt` in the shared crate (`ADR §"Shared crypto crate"` line 620). Offline queuing stores ciphertext server-side, not plaintext (`ADR §"Sync vs async response semantics"` line 377).

Under T1/active compromise, the ADR is now honest: the server can replace the node pubkey and MITM the exchange (`ADR §"Threat model"` lines 51-56, `ADR §"T1 data-substitution caveat"` line 97). Phase 4.5 detects JS substitution only when the admin verifies the signed manifest out-of-band (`ADR §"Goals"` line 27, `ADR §"Code-integrity infrastructure"` line 430). That is a correct v1 boundary.

I found no current code path that contradicts the legacy invariant. The existing admin `node-credential push` posts metadata only (`cli/src/commands/node_credential.rs:23-32`) and prints "Do not send the secret value" (`cli/src/commands/node_credential.rs:48-54`). The node-side accept command prompts/stores locally, then marks consumed through the node-agent API (`cli/src/node/agent.rs:466-505`). The existing model stores pending metadata only (`backend/src/models/node_pending_credential.rs:26-47`).

## Crypto Envelope Review

The primitive choices are reasonable: X25519, HKDF-SHA256, and XChaCha20-Poly1305 are appropriate for an ephemeral ECDH envelope (`ADR §"Cryptographic primitives"` lines 162-170). Random 24-byte nonces for XChaCha20-Poly1305 are suitable if generated per encryption and covered by interop/unit tests (`ADR §"Test Strategy"` lines 519-527).

The main crypto-design improvements are not primitive changes; they are canonicalization controls. The KDF/AAD encoding should include fixed protocol labels and be generated by a shared builder, not handwritten in frontend, CLI, and node code (`ADR §"HKDF info vs AEAD AAD"` lines 174-177). Base64url must be pinned to padded or unpadded form (`ADR §"Cryptographic primitives"` line 170). These are medium-severity interop and future-proofing findings, not reasons to abandon the design.

## Race, DoS, And Queue Review

The first-push-wins plan is correct in shape: `find_one_and_update` with a null-ciphertext guard is the right MongoDB primitive (`ADR §"Race protection"` line 363). The strict sync ACK requirement also matches existing code: `send_credential_update_and_wait` allocates a request ID, registers a oneshot waiter, sends over WSS, and maps timeout/drop/negative ACK to node-failure errors (`backend/src/services/node_ws_manager.rs:1551-1633`).

The DoS/resource gaps are in queue representation and cap placement. The 15-minute queue TTL is not representable in the proposed single-node model (F1). The 16 KB cap should be repeated at all ingress and egress boundaries, not only HTTP POST (F6). The per-pending in-memory rate limiter is acceptable as a local throttle, but in multi-instance deployments it is not a global guarantee; the current node agent already acknowledges multi-instance WSS worker split behavior (`cli/src/node/ws_client.rs:99-101`). If NyxID deploys multiple backend workers, Phase 1 should document whether the rate limiter is best-effort per worker or move it to MongoDB/Redis.

## Backward Compatibility Review

The ADR preserves legacy `crypto: None` behavior (`ADR §"Backward compatibility & version detection"` lines 383-403, `ADR §"Accept gate"` lines 416-420). That aligns with the current split between laptop-side `nyxid node-credential push` and node-side `nyxid node credentials accept`: user-side push creates metadata only (`cli/src/commands/node_credential.rs:23-32`), node-side accept prompts for the secret and stores locally (`cli/src/node/agent.rs:466-505`, `cli/src/node/agent.rs:1331-1363`). I did not find a forced migration path in the ADR.

## Error-Code Review

The ADR's 8006-8011 range is non-overlapping with current code. Current node errors occupy 8000-8005 (`backend/src/errors/mod.rs:374-379`), and the ADR correctly reserves the next six slots (`ADR §"Error codes"` lines 486-497). The doc cleanup should update both `AGENTS.md:85` and `CLAUDE.md:85`, not only AGENTS.

## Recommended Gate Before Implementation

Fix F1 before treating the ADR as implementation-ready. F2 should be resolved before Phase 3.5 starts, but it does not block Phase 1 single-node stubs if the fan-out endpoint is explicitly deferred. F3-F8 should be folded into Phase 1/2 acceptance criteria because they are cheapest to fix at protocol-boundary time.

⟦AI:AUTO-LOOP⟧

kaiweijw added a commit that referenced this pull request Jun 4, 2026
错误码 8006-8011 + CryptoBundle/RemoteCryptoState + 离线队列生命周期字段(审计 F1)+ redacting Debug(CLAUDE §5)+ 两个 partial 索引/sweep + node-error 真值同步。design-consensus #872 + 独立安全审计 #822 + 3-reviewer 双轮。Refs #773
kaiweijw added a commit that referenced this pull request Jun 8, 2026
…ases 1–6 rollup) (#895)

* feat(backend): #773 Phase 1 远程凭证注入后端协议骨架 (#873)

错误码 8006-8011 + CryptoBundle/RemoteCryptoState + 离线队列生命周期字段(审计 F1)+ redacting Debug(CLAUDE §5)+ 两个 partial 索引/sweep + node-error 真值同步。design-consensus #872 + 独立安全审计 #822 + 3-reviewer 双轮。Refs #773

* feat(crypto): #773 Phase 2 node 端 RCI crypto + nyxid-crypto crate (#877)

CLI/node-owned nyxid-crypto(X25519/HKDF/XChaCha20)+ node sealed-key 生命周期 + CI dependency-graph invariant 证明 backend 零 crypto 依赖(backend LOC=0)。design-consensus #874(r4)+ 独立 verify + 2 轮 3-reviewer。backend WS 路由见 Phase 2.5 #876。Refs #773

* feat(backend): #773 Phase 2.5 RCI backend WS 帧路由(非 crypto) (#879)

existing-service glue;backend 零 crypto 依赖(CI-enforced);async 202;GET pubkey/POST ciphertext/WS frame/reconnect drain;删 DecryptedPendingConfirmation;ACL 负向 + 行为测试。design-consensus #876 r6(supervisor 覆盖误判 drop)。Refs #773

* feat(audit): #773 Phase 4 — RCI metadata-only audit + observability (#881)

Typed RciAudit* interface; metadata-only audit for all 18 node_credential_rci_* emission branches with read-back coverage; single error-code taxonomy source; backend stays zero-crypto (boundary gate passes); legacy crypto:None unchanged. Closes #878. Reviews: architect comment, tests approve, quality comment (advisory).

⟦AI:AUTO-LOOP⟧

* feat(frontend): #773 Phase 3 — browser WebCrypto RCI push + accept (#882)

Browser @noble E2E-encrypted remote credential injection: lib/crypto (X25519+HKDF+XChaCha20-Poly1305, base64url no-pad), metadata-only RHF+Zod push forms (remote_crypto literal-true), independent credential-accept page (uncontrolled secret, browser-only encrypt, posts only CiphertextEnvelope), interop fixture mutual-decryptability. return_to open-redirect hardened via shared trusted-origin guard. Backend + nyxid-crypto zero diff. Closes #880. Reviews: architect approve (security re-review) / tests approve / quality approve.

⟦AI:AUTO-LOOP⟧

* feat(node): #773 Phase 3.5 — multi-node credential fan-out (#884)

Embedded fan_out_nodes + revision-guard atomic N-ciphertext submit; separate offline-aware CredentialFanOutTargetResolver with per-node ACL (no NodeRoute coupling); all-N/partial_decrypted/retry-failed/partial->expired; F6 per-element+aggregate+MAX_FAN_OUT_TARGETS caps; metadata-only fan-out audit via typed rci_audit_service; route-level success + boundary coverage. Backend zero-crypto (boundary exit 0); single-node Phase 3 + legacy crypto:None byte-unchanged. Closes #883. Reviews: architect approve / tests approve / quality comment.

⟦AI:AUTO-LOOP⟧

* feat(security): #773 Phase 4.5 — code-integrity (standalone accept page + SRI + admin verify) (#887)

Standalone hardened credential-accept page via zero-crypto backend handler (strict CSP + SHA-384 SRI, replaces Phase 3 SPA accept); admin fingerprint-verify UX; releases.json manifest + byte-identical browser/Node fingerprint canonicalization; vite plugin preserves data-nyx-integrity-role + backend 503 fail-closed; release workflow disabled=>warn / enabled+complete=>sign / enabled+incomplete=>fail-fast (no fabricated infra); docs/RELEASE_INTEGRITY.md 6-item runbook + reachability; service-layer policy; Zod-validated runtime-config. Backend zero-crypto; single-node+fan-out(retry)+legacy crypto:None byte-unchanged; integrity_verification audit-only. Closes #885. Reviews: architect approve / tests approve / quality comment.

⟦AI:AUTO-LOOP⟧

* feat(cli): #773 Phase 5 — CLI inject (nyxid node-credential inject) (#891)

Admin-side full-e2e remote credential injection via the local nyxid binary (T1 code-substitution-immune): push remote_crypto -> poll pubkey -> encrypt locally (reusing the Phase 2 shared RciContext, byte-identical to node decrypt) -> post ciphertext. interactive/--secret-env/--browser modes; strict single-format pubkey fingerprint helper shared with the node-agent console for out-of-band verification; Zeroizing secret never logged; --browser reads no secret; inject --pending metadata-only-inits legacy pendings (no backend crypto); push default unchanged; fan-out inject deferred. Backend stays ZERO crypto (boundary exit 0). Closes #889. Reviews: architect approve / tests approve / quality approve.

⟦AI:AUTO-LOOP⟧

* docs(cli): #773 Phase 6 — RCI discovery hint rewrites (#894)

Text-only CLI hint rewrites in node-credential + service commands so agents/admins discover the full RCI flow (push metadata-only -> inject/--browser/--secret-env/accept page, out-of-band --verify-fingerprint, org opt-out, error codes 8006-8011) with honest detection-only/T1 framing. Shared RciCliHintLines renderer; tests assert exact merged commands + reject phantom flags. No behavior/flag/crypto change; backend/nyxid-crypto/frontend untouched; boundary exit 0. Closes #892. Reviews: architect approve / tests approve / quality comment. FINAL phase of the #773 remote-credential-injection feature.

⟦AI:AUTO-LOOP⟧

* fix(rci): resolve clippy lints under current stable for #773 rollup (#895)

CI runs clippy against unpinned latest-stable (rustc 1.96); these lints were
clean under the older stable the phase PRs ran with. All in RCI code:

- node_credential.rs: collapse nested if-let via .filter() (production poll loop)
- credential_accept.rs: let-else -> ? in SRI script-tag parser (production)
- node_credential.rs tests: #[allow(clippy::await_holding_lock)] on 4 env_lock
  single-thread tests (env mutation serialized; held across await by design)
- node_pending_credential_service.rs tests: &[x.clone()] -> std::slice::from_ref(&x)

Verified: cargo +stable clippy --workspace --all-targets -- -D warnings (1.96) clean;
backend zero-crypto boundary unaffected (test/lint-only + 2 trivial prod rewrites).

* style(rci): cargo fmt the slice::from_ref test call wrapping (#895)

Pure rustfmt formatting of the 3 insert_fan_out_pending test calls touched by
the prior clippy fix; no logic change.

* docs(skill): #773 document remote credential injection in nyxid skill (#895)

The 8 RCI phases shipped `nyxid node-credential inject` (browser/laptop
admin-side end-to-end encryption) but never updated the agent-facing skill.

- references/nodes.md: new 'Remote credential injection' subsection —
  inject (interactive/--secret-env/--browser), out-of-band fingerprint
  verification, secret-never-on-argv, browser accept page (SRI/signed
  manifest; detection-not-prevention), web-UI multi-node fan-out, and
  error codes 8006-8011.
- SKILL.md: add inject / remote-injection triggers to the nodes reference
  map row; add an end-to-end-encryption bullet to Security and Privacy.

Docs only; no version bump (release process owns skill versioning).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant