Skip to content

test(e2e): migrate agent turn latency to vitest#5554

Open
cv wants to merge 14 commits into
mainfrom
e2e-migrate/test-agent-turn-latency
Open

test(e2e): migrate agent turn latency to vitest#5554
cv wants to merge 14 commits into
mainfrom
e2e-migrate/test-agent-turn-latency

Conversation

@cv

@cv cv commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Migrates the real agent turn latency E2E into a typed live Vitest scenario. The new test onboards OpenClaw and Hermes against hosted inference, verifies their managed inference.local configuration, and times one real model-backed turn through each runtime.

Related Issue

Refs #5098

Changes

  • Add a typed live Vitest replacement for test/e2e/test-agent-turn-latency-e2e.sh.
  • Wire a free-standing dispatchable Vitest job into .github/workflows/e2e-vitest-scenarios.yaml.
  • Preserve legacy shell deletion and any legacy shell workflow cleanup for Phase 11 per Epic: Migrate legacy bash E2E into the Vitest E2E system #5098 migration governance.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • PR description includes the DCO sign-off declaration and every commit appears as Verified in GitHub
  • Git hooks passed during commit and push, or npx prek run --from-ref main --to-ref HEAD passes
  • Targeted tests pass for changed behavior
  • Full npm test passes (broad runtime changes only)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Targeted commands run:

  • npx biome check --write test/e2e-scenario/live/agent-turn-latency.test.ts
  • NEMOCLAW_RUN_E2E_SCENARIOS=1 npx vitest run --project e2e-scenarios-live test/e2e-scenario/live/agent-turn-latency.test.ts -t __compile_only_nomatch__ --silent=false --reporter=default --passWithNoTests
  • npx vitest run --project e2e-vitest-support test/e2e-scenario/support-tests/e2e-scenarios-workflow.test.ts
  • npx tsx scripts/check-test-file-size-budget.ts test/e2e-scenario/live/agent-turn-latency.test.ts
  • npx tsc --noEmit --strict --moduleResolution bundler --module preserve --target ES2022 --types node --allowImportingTsExtensions test/e2e-scenario/live/agent-turn-latency.test.ts
  • git diff --check

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Tests

    • Replaced the prior shell-based latency check with a gated Live E2E Vitest test that runs two sequential hosted-inference turns (OpenClaw, then Hermes).
    • Validates routing/config, checks sandbox readiness, measures wall-clock latency against the configured limit, asserts expected model output, writes run/results artifacts, and performs best-effort sandbox cleanup.
  • CI/CD

    • Added a dedicated free-standing Vitest CI job for the latency scenario (conditional on selected scenarios/jobs), uploads artifacts, and surfaces results in the PR’s Vitest scenario table.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv self-assigned this Jun 19, 2026
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new live Vitest e2e test (agent-turn-latency.test.ts) that measures agent turn latency for OpenClaw and Hermes sandboxes, including retry logic, latency assertions, and JSON artifact output. A corresponding CI workflow job (agent-turn-latency-vitest) is added and wired into the report-to-pr aggregation.

Changes

Agent Turn Latency E2E Scenario

Layer / File(s) Summary
Test configuration and environment setup
test/e2e-scenario/live/agent-turn-latency.test.ts
Defines repo/CLI paths, sandbox/model/provider/timeout constants from environment, and an env() helper that builds per-sandbox process environments with agent selection and inference API key injection.
Test utilities and validation helpers
test/e2e-scenario/live/agent-turn-latency.test.ts
Implements test utilities (bestEffort(), responseBodyAndStatus(), chatContent(), msSince() for latency measurement), configuration validators (assertOpenClawConfig(), assertHermesConfig()), and installSandbox() with bounded retries and transient-failure detection.
Gated live scenario: install, execute, assert, and record latency
test/e2e-scenario/live/agent-turn-latency.test.ts
Implements the full gated live test: orchestrates cleanup, docker check, sequential OpenClaw and Hermes sandbox install/run, per-turn latency measurement via process.hrtime.bigint(), integer-42 and timeout assertions, and writes turn-latency-results.json and legacy-path artifacts.
CI workflow job and PR report wiring
.github/workflows/e2e-vitest-scenarios.yaml
Adds the agent-turn-latency-vitest job (OpenShell install, NVIDIA_INFERENCE_API_KEY injection, Vitest execution, artifact upload) and registers it in the report-to-pr needs array.

Sequence Diagram(s)

sequenceDiagram
  participant CI as GitHub Actions
  participant Test as agent-turn-latency.test.ts
  participant OpenClaw as OpenClaw Sandbox
  participant Hermes as Hermes Sandbox
  participant Artifacts as e2e-artifacts/

  CI->>Test: vitest run (with NVIDIA_INFERENCE_API_KEY)
  Test->>Test: shouldRunLiveE2EScenarios() gate
  Test->>OpenClaw: installSandbox() — install.sh with retries
  OpenClaw-->>Test: install success
  Test->>OpenClaw: run agent via sandbox shell (record bigint start)
  OpenClaw-->>Test: response text
  Test->>Test: assert contains 42 and within latency max
  Test->>OpenClaw: destroy sandbox
  Test->>Hermes: installSandbox() — install.sh with retries
  Hermes-->>Test: install success
  Test->>Hermes: curl /v1/chat/completions (record bigint start)
  Hermes-->>Test: JSON response
  Test->>Test: chatContent() extract, assert 42 and latency
  Test->>Artifacts: write turn-latency-results.json
  CI->>CI: upload artifacts from e2e-artifacts/vitest/agent-turn-latency/
  CI->>CI: report-to-pr includes agent-turn-latency-vitest result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#5349: Adds another free-standing live Vitest job to the same e2e-vitest-scenarios.yaml workflow and extends report-to-pr.needs with the same pattern.
  • NVIDIA/NemoClaw#5493: Follows the same pattern — new free-standing live Vitest job in e2e-vitest-scenarios.yaml plus report-to-pr needs update — for a different scenario.
  • NVIDIA/NemoClaw#5494: Adds a standalone live Vitest job to the same workflow and wires it into report-to-pr for PR result aggregation.

Suggested labels

area: e2e, chore

Poem

🐇 Hop hop, the latency clock ticks away,
OpenClaw and Hermes both tested today.
Install, retry, assert — does 42 appear?
The bigint timer marks each millisecond clear.
CI uploads the proof, the PR table glows,
A bunny checks the workflow — and off it goes! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'test(e2e): migrate agent turn latency to vitest' directly and concisely describes the main change: migrating an E2E test from shell script to Vitest.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch e2e-migrate/test-agent-turn-latency

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-code-quality

github-code-quality Bot commented Jun 19, 2026

Copy link
Copy Markdown

Code Coverage Overview

Languages: TypeScript

TypeScript / code-coverage/plugin

The overall coverage in the branch is 96%. Coverage data for the branch is not yet available.

Show a code coverage summary of the most covered files.
File 9d3e26e +/-
nemoclaw/src/se...cret-scanner.ts 100%
nemoclaw/src/commands/slash.ts 100%
nemoclaw/src/li...bprocess-env.ts 100%
nemoclaw/src/bl...eprint/state.ts 98%
nemoclaw/src/onboard/config.ts 98%
nemoclaw/src/bl...int/snapshot.ts 97%
nemoclaw/src/bl...print/runner.ts 95%
nemoclaw/src/co...ration-state.ts 94%
nemoclaw/src/bl...ate-networks.ts 94%
nemoclaw/src/index.ts 94%

TypeScript / code-coverage/cli

The overall coverage in the branch is 46%. Coverage data for the branch is not yet available.

Show a code coverage summary of the most covered files.
File 9d3e26e +/-
src/lib/state/o...oard-session.ts 90%
src/lib/inference/local.ts 76%
src/lib/sandbox/config.ts 72%
src/lib/actions...dbox/rebuild.ts 67%
src/lib/onboard/preflight.ts 64%
src/lib/actions...licy-channel.ts 56%
src/lib/state/sandbox.ts 55%
src/lib/policy/index.ts 49%
src/lib/onboard...er-gpu-patch.ts 44%
src/lib/onboard.ts 18%

Updated June 20, 2026 18:51 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: agent-turn-latency
Optional E2E: inference-routing, hermes-e2e

Dispatch hint: agent-turn-latency-vitest

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • agent-turn-latency (high; live hosted inference with two sandbox installs, timeout 110 minutes): Directly validates the newly added workflow job and live scenario covering two real sandboxes, hosted inference routing, OpenClaw agent turn latency, and Hermes API turn latency.

Optional E2E

  • inference-routing (medium): Useful adjacent confidence for provider route classification and inference.local behavior used by the new latency scenario.
  • hermes-e2e (high): Adjacent confidence for the Hermes onboarding and live inference path also exercised by the new agent-turn-latency scenario.

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: .github/workflows/e2e-vitest-scenarios.yaml
  • jobs input: agent-turn-latency-vitest

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: agent-turn-latency-vitest
Optional Vitest E2E scenarios: None

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=agent-turn-latency-vitest

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • agent-turn-latency-vitest: Focused free-standing Vitest job wired for changed live test test/e2e-scenario/live/agent-turn-latency.test.ts.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=agent-turn-latency-vitest

Optional Vitest E2E scenarios

  • None.

Relevant changed files

  • .github/workflows/e2e-vitest-scenarios.yaml
  • test/e2e-scenario/live/agent-turn-latency-helpers.ts
  • test/e2e-scenario/live/agent-turn-latency.test.ts

Comment thread test/e2e-scenario/live/agent-turn-latency.test.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/live/agent-turn-latency.test.ts`:
- Around line 157-165: The openclawConfig variable captures raw config output
without redacting sensitive credentials, creating a potential API key leakage
risk into artifacts. The Hermes config path already implements redaction of
apiKey values, but openclawConfig does not. Add similar redaction logic to
openclawConfig to remove or mask sensitive fields like apiKey before the output
is captured in the artifact, following the same redaction pattern that is
already applied to the Hermes config to prevent credentials from leaking into
logs and artifacts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9e10c9c-b55e-4d56-96da-a59a80764c50

📥 Commits

Reviewing files that changed from the base of the PR and between 9d4d1f5 and 681deec.

📒 Files selected for processing (2)
  • .github/workflows/e2e-vitest-scenarios.yaml
  • test/e2e-scenario/live/agent-turn-latency.test.ts

Comment thread test/e2e-scenario/live/agent-turn-latency.test.ts
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 13 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 8 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Hosted inference compatibility setup: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: MODEL, PROVIDER, EXPECTED_ROUTE_PROVIDER, and env() do not propagate ci-compatible-inference helper outputs.
  • Source-of-truth review needed: OpenClaw JSON assistant-text extraction: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: collectAssistantText() omits retained keys such as reasoning_content, payload, response, data, output, outputs, items, and segments.
  • Source-of-truth review needed: Hermes readiness recovery: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: New hermesHealth is a single curl -sf --max-time 10 with only exit-code assertion.
  • Source-of-truth review needed: Transient install retry cleanup: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: installSandbox() retries after isTransientProviderValidationFailure() without invoking cleanupTurnSandboxes() or retained forward/gateway cleanup.
  • Source-of-truth review needed: bestEffort cleanup swallowing: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: bestEffort() catches all errors and cleanupTurnSandboxes() uses it for both destroy and sandbox delete.
  • Port the retained hosted-inference trust-boundary setup (test/e2e-scenario/live/agent-turn-latency-helpers.ts:24): The migrated helper derives model, provider, and expected route directly and only maps NVIDIA key aliases into the install environment. The retained shell source configures compatible hosted inference before deriving these values, including endpoint URL, compatible credential variable, compatible model, preferred API, provider key, and expected route provider. This can make the new secret-bearing live test validate a different hosted-inference boundary than the shell scenario it claims to replace.
    • Recommendation: Reuse or port the retained compatible-inference decision logic so NEMOCLAW_ENDPOINT_URL, COMPATIBLE_API_KEY, NEMOCLAW_COMPAT_MODEL, NEMOCLAW_PREFERRED_API, model, provider, and expected route provider are propagated consistently. Add focused coverage for compatible-hosted mode and default-provider/model selection.
    • Evidence: Retained test/e2e/test-agent-turn-latency-e2e.sh calls nemoclaw_e2e_configure_compatible_inference before deriving TURN_MODEL, TURN_PROVIDER_KEY, and EXPECTED_ROUTE_PROVIDER. New MODEL, PROVIDER, EXPECTED_ROUTE_PROVIDER, and env() do not use ci-compatible-inference helper outputs.
  • Harden Hermes local API token handling in artifacts (test/e2e-scenario/live/agent-turn-latency-helpers.ts:280): The Hermes command sources /sandbox/.hermes/.env and sends API_SERVER_KEY as a bearer token, but the probe explicitly redacts only the hosted inference key. The centralized redaction regex should catch common Bearer-token echoes, but a sandbox-local token echoed in an unexpected or short shape could still be preserved in uploaded artifacts. The migrated command also sends an Authorization header even when API_SERVER_KEY is empty, unlike the retained shell.
    • Recommendation: Preserve the retained conditional Authorization header behavior. If Hermes response bodies remain artifacted, capture the sandbox-local API_SERVER_KEY through a redaction-safe path and add it to explicit redactionValues. Add a regression case where Hermes or middleware echoes the local token.
    • Evidence: hermesTurnCommand() builds -H "Authorization: Bearer ${API_SERVER_KEY:-}" unconditionally. The hermes-api-turn invocation passes redactionValues: [apiKey], not the sandbox-local API_SERVER_KEY.
  • Restore clean sandbox, gateway, and forwarding cleanup before install retries (test/e2e-scenario/live/agent-turn-latency-helpers.ts:187): The retained shell scenario pre-cleans sandbox and gateway state before the first install and before retrying transient provider validation failures. The migrated installSandbox() sleeps and retries without running the retained cleanup sequence, and cleanupTurnSandboxes() does not stop forwarding or destroy the OpenShell gateway. In a credential-bearing live path, this weakens confidence that retries start from a clean sandbox, gateway, forwarding, and credential-artifact state.
    • Recommendation: Run the full retained cleanup sequence before each retry, or document and test that NEMOCLAW_RECREATE_SANDBOX=1 fully clears failed-install sandbox, gateway, forwarding, and credential artifacts. Include forward-stop and gateway-destroy behavior where needed, and avoid swallowing cleanup failures without evidence.
    • Evidence: Retained destroy_sandbox stops openshell forward 8642, deletes the sandbox, destroys the nemoclaw gateway, and invokes both normal and Hermes destroy paths. New installSandbox() retries after isTransientProviderValidationFailure() without cleanup, and cleanupTurnSandboxes() only calls node CLI destroy and openshell sandbox delete inside bestEffort().
  • Restore OpenClaw provider, transport, and policy-error rejection (test/e2e-scenario/live/agent-turn-latency.test.ts:83): The retained OpenClaw turn fails if combined stdout/stderr contains provider, transport, or policy-error signatures. The migrated test accepts a zero exit code and a parsed 42 reply without checking those signatures, so a successful-looking turn accompanied by SSRF/network-policy/provider errors could pass.
    • Recommendation: Carry over the retained provider/transport error signature check against the captured OpenClaw turn output before accepting the 42 reply. Add a focused negative test where output includes 42 plus a retained error signature.
    • Evidence: Retained run_openclaw_turn greps combined stdout/stderr for SsrFBlockedError, transport error, ECONNREFUSED, EAI_AGAIN, gateway unavailable, and network connection error. New assertions check only exitCode, extractOpenClawAgentText(stdout), and elapsed time.
  • Restore Hermes readiness retry and ok-body validation (test/e2e-scenario/live/agent-turn-latency.test.ts:104): The retained shell scenario polls Hermes health up to ten times and requires the response body to contain ok before posting the chat completion request. The migrated test performs a single curl -sf and checks only the exit code, which can make the live scenario flaky after install or proceed on a non-ok 2xx health response.
    • Recommendation: Restore a bounded readiness loop that retries /health, sleeps between attempts, and requires an ok response body before the Hermes chat request. Cover initial failures and non-ok 2xx responses followed by a successful ok response.
    • Evidence: Retained assert_hermes_health loops over attempts 1..10 with sleep 5 and grep -qi '"ok"'. New hermesHealth is one curl -sf --max-time 10 followed by an exit-code assertion.
  • Port the retained OpenClaw JSON assistant-text parser (test/e2e-scenario/live/agent-turn-latency-helpers.ts:119): The migrated OpenClaw parser accepts fewer retained envelope shapes and has no focused tests proving it avoids metadata-only false positives. This can fail valid OpenClaw output accepted by the shell source, or miss visible assistant text when OpenClaw emits a supported compatibility envelope.
    • Recommendation: Port the retained openclaw-json.sh behavior into extractOpenClawAgentText(), or document why omitted envelope shapes are no longer valid. Add focused tests for all accepted envelope shapes and for metadata-only 42 not satisfying the reply assertion.
    • Evidence: Retained openclaw-json.sh accepts text, content, and reasoning_content and recurses through result, payloads, payload, messages, choices, response, data, output, outputs, items, segments, and delta. New collectAssistantText() walks result, payloads, messages, choices, message, delta, content, and text.
  • Add a dedicated workflow-boundary check for the new secret-bearing job (.github/workflows/e2e-vitest-scenarios.yaml:665): The new agent-turn-latency-vitest job is secret-bearing and runs installer, OpenShell, sandbox lifecycle, and hosted-inference code. It follows several good workflow patterns, but the workflow-boundary source only applies generic free-standing checks and does not validate this job's exact secret scope, checkout credential persistence, dependency install command, OpenShell install boundary, artifact settings, timeout, stable artifact path, and report coverage.
    • Recommendation: Add a validateAgentTurnLatencyVitestJob-style boundary check and support-test coverage mirroring comparable high-risk free-standing jobs. Assert the expected env, pinned actions, persist-credentials=false, npm ci --ignore-scripts, no secrets outside the Vitest step, artifact upload settings, and report-to-pr needs entry.
    • Evidence: tools/e2e-scenarios/workflow-boundary.mts has dedicated validators for many comparable jobs, but grep found no agent-turn-latency-specific validator. The workflow diff adds NVIDIA_INFERENCE_API_KEY to the Vitest step.
  • Add focused support coverage for migration fidelity and secret-bearing branches (test/e2e-scenario/live/agent-turn-latency.test.ts:1): This PR adds the live scenario and workflow job, but not focused support tests for the helper branches and retained-shell guardrails that changed during migration. Static review can compare the code, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, and sandbox/gateway cleanup.
    • Recommendation: Add behavior-specific support or integration tests for hosted-compatible inference env mapping, OpenClaw JSON envelope compatibility, Hermes health retry/body validation, transient retry cleanup, Hermes local API token redaction/conditional header behavior, OpenClaw transport-error rejection, and agent-turn-latency workflow-boundary invariants.
    • Evidence: New tolerant/recovery paths include positiveInt(), firstJsonObject(), collectAssistantText(), bestEffort(), transient install retry handling, and hermesTurnCommand(), while the PR only adds the live scenario and workflow job.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Hosted-compatible inference env mapping uses NEMOCLAW_ENDPOINT_URL, COMPATIBLE_API_KEY, NEMOCLAW_COMPAT_MODEL, NEMOCLAW_PREFERRED_API, model, provider, and expected route provider.. The PR changes a secret-bearing live workflow and two real sandbox hosted-inference paths. Static review can assess structure and migration fidelity, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, sandbox/gateway cleanup, and network-policy error surfaces.
  • **Runtime validation** — OpenClaw agent output containing 42 plus SsrFBlockedError or a retained transport error signature is rejected.. The PR changes a secret-bearing live workflow and two real sandbox hosted-inference paths. Static review can assess structure and migration fidelity, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, sandbox/gateway cleanup, and network-policy error surfaces.
  • **Runtime validation** — Hermes health readiness retries initial failures and rejects non-ok 2xx responses before accepting ok.. The PR changes a secret-bearing live workflow and two real sandbox hosted-inference paths. Static review can assess structure and migration fidelity, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, sandbox/gateway cleanup, and network-policy error surfaces.
  • **Runtime validation** — Transient install retry runs full sandbox, forward, and gateway cleanup before retry.. The PR changes a secret-bearing live workflow and two real sandbox hosted-inference paths. Static review can assess structure and migration fidelity, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, sandbox/gateway cleanup, and network-policy error surfaces.
  • **Runtime validation** — Hermes turn omits Authorization when API_SERVER_KEY is empty and redacts echoed API_SERVER_KEY when present.. The PR changes a secret-bearing live workflow and two real sandbox hosted-inference paths. Static review can assess structure and migration fidelity, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, sandbox/gateway cleanup, and network-policy error surfaces.
  • **Add focused support coverage for migration fidelity and secret-bearing branches** — Add behavior-specific support or integration tests for hosted-compatible inference env mapping, OpenClaw JSON envelope compatibility, Hermes health retry/body validation, transient retry cleanup, Hermes local API token redaction/conditional header behavior, OpenClaw transport-error rejection, and agent-turn-latency workflow-boundary invariants.
  • **Hosted inference compatibility setup** — Missing coverage for compatible-hosted mode env mapping and expected route provider.. MODEL, PROVIDER, EXPECTED_ROUTE_PROVIDER, and env() do not propagate ci-compatible-inference helper outputs.
  • **OpenClaw JSON assistant-text extraction** — Missing focused tests for retained text/content/reasoning_content fields, all retained container keys, and metadata-only 42 rejection.. collectAssistantText() omits retained keys such as reasoning_content, payload, response, data, output, outputs, items, and segments.
Since last review details

Current findings:

  • Source-of-truth review needed: Hosted inference compatibility setup: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: MODEL, PROVIDER, EXPECTED_ROUTE_PROVIDER, and env() do not propagate ci-compatible-inference helper outputs.
  • Source-of-truth review needed: OpenClaw JSON assistant-text extraction: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: collectAssistantText() omits retained keys such as reasoning_content, payload, response, data, output, outputs, items, and segments.
  • Source-of-truth review needed: Hermes readiness recovery: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: New hermesHealth is a single curl -sf --max-time 10 with only exit-code assertion.
  • Source-of-truth review needed: Transient install retry cleanup: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: installSandbox() retries after isTransientProviderValidationFailure() without invoking cleanupTurnSandboxes() or retained forward/gateway cleanup.
  • Source-of-truth review needed: bestEffort cleanup swallowing: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: bestEffort() catches all errors and cleanupTurnSandboxes() uses it for both destroy and sandbox delete.
  • Port the retained hosted-inference trust-boundary setup (test/e2e-scenario/live/agent-turn-latency-helpers.ts:24): The migrated helper derives model, provider, and expected route directly and only maps NVIDIA key aliases into the install environment. The retained shell source configures compatible hosted inference before deriving these values, including endpoint URL, compatible credential variable, compatible model, preferred API, provider key, and expected route provider. This can make the new secret-bearing live test validate a different hosted-inference boundary than the shell scenario it claims to replace.
    • Recommendation: Reuse or port the retained compatible-inference decision logic so NEMOCLAW_ENDPOINT_URL, COMPATIBLE_API_KEY, NEMOCLAW_COMPAT_MODEL, NEMOCLAW_PREFERRED_API, model, provider, and expected route provider are propagated consistently. Add focused coverage for compatible-hosted mode and default-provider/model selection.
    • Evidence: Retained test/e2e/test-agent-turn-latency-e2e.sh calls nemoclaw_e2e_configure_compatible_inference before deriving TURN_MODEL, TURN_PROVIDER_KEY, and EXPECTED_ROUTE_PROVIDER. New MODEL, PROVIDER, EXPECTED_ROUTE_PROVIDER, and env() do not use ci-compatible-inference helper outputs.
  • Harden Hermes local API token handling in artifacts (test/e2e-scenario/live/agent-turn-latency-helpers.ts:280): The Hermes command sources /sandbox/.hermes/.env and sends API_SERVER_KEY as a bearer token, but the probe explicitly redacts only the hosted inference key. The centralized redaction regex should catch common Bearer-token echoes, but a sandbox-local token echoed in an unexpected or short shape could still be preserved in uploaded artifacts. The migrated command also sends an Authorization header even when API_SERVER_KEY is empty, unlike the retained shell.
    • Recommendation: Preserve the retained conditional Authorization header behavior. If Hermes response bodies remain artifacted, capture the sandbox-local API_SERVER_KEY through a redaction-safe path and add it to explicit redactionValues. Add a regression case where Hermes or middleware echoes the local token.
    • Evidence: hermesTurnCommand() builds -H "Authorization: Bearer ${API_SERVER_KEY:-}" unconditionally. The hermes-api-turn invocation passes redactionValues: [apiKey], not the sandbox-local API_SERVER_KEY.
  • Restore clean sandbox, gateway, and forwarding cleanup before install retries (test/e2e-scenario/live/agent-turn-latency-helpers.ts:187): The retained shell scenario pre-cleans sandbox and gateway state before the first install and before retrying transient provider validation failures. The migrated installSandbox() sleeps and retries without running the retained cleanup sequence, and cleanupTurnSandboxes() does not stop forwarding or destroy the OpenShell gateway. In a credential-bearing live path, this weakens confidence that retries start from a clean sandbox, gateway, forwarding, and credential-artifact state.
    • Recommendation: Run the full retained cleanup sequence before each retry, or document and test that NEMOCLAW_RECREATE_SANDBOX=1 fully clears failed-install sandbox, gateway, forwarding, and credential artifacts. Include forward-stop and gateway-destroy behavior where needed, and avoid swallowing cleanup failures without evidence.
    • Evidence: Retained destroy_sandbox stops openshell forward 8642, deletes the sandbox, destroys the nemoclaw gateway, and invokes both normal and Hermes destroy paths. New installSandbox() retries after isTransientProviderValidationFailure() without cleanup, and cleanupTurnSandboxes() only calls node CLI destroy and openshell sandbox delete inside bestEffort().
  • Restore OpenClaw provider, transport, and policy-error rejection (test/e2e-scenario/live/agent-turn-latency.test.ts:83): The retained OpenClaw turn fails if combined stdout/stderr contains provider, transport, or policy-error signatures. The migrated test accepts a zero exit code and a parsed 42 reply without checking those signatures, so a successful-looking turn accompanied by SSRF/network-policy/provider errors could pass.
    • Recommendation: Carry over the retained provider/transport error signature check against the captured OpenClaw turn output before accepting the 42 reply. Add a focused negative test where output includes 42 plus a retained error signature.
    • Evidence: Retained run_openclaw_turn greps combined stdout/stderr for SsrFBlockedError, transport error, ECONNREFUSED, EAI_AGAIN, gateway unavailable, and network connection error. New assertions check only exitCode, extractOpenClawAgentText(stdout), and elapsed time.
  • Restore Hermes readiness retry and ok-body validation (test/e2e-scenario/live/agent-turn-latency.test.ts:104): The retained shell scenario polls Hermes health up to ten times and requires the response body to contain ok before posting the chat completion request. The migrated test performs a single curl -sf and checks only the exit code, which can make the live scenario flaky after install or proceed on a non-ok 2xx health response.
    • Recommendation: Restore a bounded readiness loop that retries /health, sleeps between attempts, and requires an ok response body before the Hermes chat request. Cover initial failures and non-ok 2xx responses followed by a successful ok response.
    • Evidence: Retained assert_hermes_health loops over attempts 1..10 with sleep 5 and grep -qi '"ok"'. New hermesHealth is one curl -sf --max-time 10 followed by an exit-code assertion.
  • Port the retained OpenClaw JSON assistant-text parser (test/e2e-scenario/live/agent-turn-latency-helpers.ts:119): The migrated OpenClaw parser accepts fewer retained envelope shapes and has no focused tests proving it avoids metadata-only false positives. This can fail valid OpenClaw output accepted by the shell source, or miss visible assistant text when OpenClaw emits a supported compatibility envelope.
    • Recommendation: Port the retained openclaw-json.sh behavior into extractOpenClawAgentText(), or document why omitted envelope shapes are no longer valid. Add focused tests for all accepted envelope shapes and for metadata-only 42 not satisfying the reply assertion.
    • Evidence: Retained openclaw-json.sh accepts text, content, and reasoning_content and recurses through result, payloads, payload, messages, choices, response, data, output, outputs, items, segments, and delta. New collectAssistantText() walks result, payloads, messages, choices, message, delta, content, and text.
  • Add a dedicated workflow-boundary check for the new secret-bearing job (.github/workflows/e2e-vitest-scenarios.yaml:665): The new agent-turn-latency-vitest job is secret-bearing and runs installer, OpenShell, sandbox lifecycle, and hosted-inference code. It follows several good workflow patterns, but the workflow-boundary source only applies generic free-standing checks and does not validate this job's exact secret scope, checkout credential persistence, dependency install command, OpenShell install boundary, artifact settings, timeout, stable artifact path, and report coverage.
    • Recommendation: Add a validateAgentTurnLatencyVitestJob-style boundary check and support-test coverage mirroring comparable high-risk free-standing jobs. Assert the expected env, pinned actions, persist-credentials=false, npm ci --ignore-scripts, no secrets outside the Vitest step, artifact upload settings, and report-to-pr needs entry.
    • Evidence: tools/e2e-scenarios/workflow-boundary.mts has dedicated validators for many comparable jobs, but grep found no agent-turn-latency-specific validator. The workflow diff adds NVIDIA_INFERENCE_API_KEY to the Vitest step.
  • Add focused support coverage for migration fidelity and secret-bearing branches (test/e2e-scenario/live/agent-turn-latency.test.ts:1): This PR adds the live scenario and workflow job, but not focused support tests for the helper branches and retained-shell guardrails that changed during migration. Static review can compare the code, but behavior depends on Docker, OpenShell, installer state, hosted inference, daemon readiness, artifact redaction, and sandbox/gateway cleanup.
    • Recommendation: Add behavior-specific support or integration tests for hosted-compatible inference env mapping, OpenClaw JSON envelope compatibility, Hermes health retry/body validation, transient retry cleanup, Hermes local API token redaction/conditional header behavior, OpenClaw transport-error rejection, and agent-turn-latency workflow-boundary invariants.
    • Evidence: New tolerant/recovery paths include positiveInt(), firstJsonObject(), collectAssistantText(), bestEffort(), transient install retry handling, and hermesTurnCommand(), while the PR only adds the live scenario and workflow job.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv added the v0.0.66 Release target label Jun 19, 2026
@cv cv linked an issue Jun 19, 2026 that may be closed by this pull request
79 tasks
cv added 3 commits June 19, 2026 14:07
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
…t-turn-latency

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
cv added 7 commits June 19, 2026 14:44
…t-turn-latency

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
…ncy' into e2e-migrate/test-agent-turn-latency

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Comment thread test/e2e-scenario/live/agent-turn-latency-helpers.ts Fixed
Comment thread test/e2e-scenario/live/agent-turn-latency-helpers.ts Fixed
cv added 2 commits June 20, 2026 11:34
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
…t-turn-latency

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.66 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants