Skip to content

fix(vision): scope the no-vision capability error to the latest user image#8180

Open
Nillth wants to merge 1 commit into
zeroclaw-labs:masterfrom
NNet-Dev:fix/vision-marker-history-poison
Open

fix(vision): scope the no-vision capability error to the latest user image#8180
Nillth wants to merge 1 commit into
zeroclaw-labs:masterfrom
NNet-Dev:fix/vision-marker-history-poison

Conversation

@Nillth

@Nillth Nillth commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Base branch: master
  • What changed and why:
    • Sending an image to a model_provider without vision support (and with no
      vision_model_provider configured) raised provider_capability_error capability=vision
      AND left the [IMAGE:] marker in the long-lived session history. Because the error was
      keyed off a history-wide marker count, every later turn (even plain text) re-counted the
      stale marker and re-failed forever. The RPC/streaming path makes it permanent: it persists
      the inbound user message into the session history before the turn loop runs, so a failed
      image turn leaves its marker behind. A single image to a non-vision provider made the
      session unusable until restart.
    • Scope the capability error to the most recent genuine user message. resolve_vision_provider
      now errors only when the user just sent an image we cannot see; a carried-over marker
      (a prior failed image turn, or a vision -> non-vision model switch mid-session) degrades to
      text-only (markers stripped, surrounding text preserved) so the conversation continues.
    • New count_latest_user_image_markers() in zeroclaw-providers, the turn-scoped counterpart
      to count_user_image_markers(), reusing the same genuine-user-message predicate.
    • The user is still told once, on the turn they send the image; subsequent text turns recover.
  • Scope boundary: No change when a vision_model_provider is configured, no change to
    tool-result image degradation (already degraded), no change to the channel orchestrator path
    (which never persisted failed-turn images). No config / CLI / API / env surface change.
  • Blast radius: resolve_vision_provider is the single shared chokepoint inside
    run_tool_call_loop (one call site, turn/mod.rs), so the behaviour change reaches every
    transport (RPC/streaming, non-streaming, channels). The new function is purely additive.
  • Linked issue(s): None found (no existing issue).
  • Labels: bug applied. risk: and size: are auto-applied by the repo labeler
    (the path scope labels agent/provider/runtime are already on); the auto-risk rule
    classifies runtime-path changes as higher risk, so risk/size are left to the automation.

Validation Evidence (required)

Toolchain pinned to CI's 1.93.0.

cargo fmt --all -- --check                                   # clean
cargo clippy --all-targets -- -D warnings                    # clean (root; validate.sh)
cargo clippy -p zeroclaw-providers -p zeroclaw-runtime -p zeroclaw-channels --all-targets -- -D warnings  # exit 0
cargo test -p zeroclaw-providers -p zeroclaw-runtime
cargo test -p zeroclaw-channels vision

Tail output:

providers: test result: ok. 1022 passed; 0 failed
runtime:   2343 passed; 1 failed; 2 ignored
  count_latest_user_image_markers_scopes_to_newest_user_message ... ok
  run_tool_call_loop_degrades_carried_over_image_on_non_vision_provider ... ok
  run_tool_call_loop_returns_structured_error_for_non_vision_provider ... ok   (preserved)
channels:  test result: ok. 6 passed; 0 failed
  e2e_failed_vision_turn_does_not_poison_follow_up_text_turn ... ok
  e2e_photo_attachment_rejected_by_non_vision_provider ... ok
  • Beyond CI - what I manually verified: Traced the RPC poison path (the streaming path
    persists the user message to the long-lived session history before the loop). Confirmed the
    single call site of resolve_vision_provider. Reproduced the fix end-to-end through the
    shared engine (carried-over image -> stripped to [media attachment], plain-text turn
    succeeds). Confirmed the first-turn capability error is preserved.
  • The one runtime test failure is unrelated:
    cron::store::tests::remove_job_emits_structured_cron_delete_event is a pre-existing
    test-isolation flake (a UUID assert against a process-global log broadcast; a sibling
    remove_job caller races its event in under in-process parallelism). The diff touches no
    cron code, the test passes in isolation, and CI's nextest (process-per-test) isolates it.
  • Intentionally skipped: the exact --features ci-all clippy combo (needs glib-2.0 /
    libudev system libs unavailable on this host). The change touches none of the voice/desktop
    features ci-all adds; deferred to CI. A docs-coverage heuristic WARN fired on the new
    pub fn, but it is an internal cross-crate helper, not a user-facing surface, so no docs are
    needed.

Security & Privacy Impact (required)

  • New permissions, capabilities, or file system access scope? No
  • New external network calls? No
  • Secrets / tokens / credentials handling changed? No
  • PII, real identities, or personal data in diff, tests, fixtures, or docs? No
  • The degrade path only strips media-marker references from text sent to a text-only provider;
    the image bytes never reach the model in either branch, and no trust boundary or policy check
    is affected.

Compatibility (required)

  • Backward compatible? Yes
  • Config / env / CLI surface changed? No
  • Upgrade steps: none. Behaviour change is strictly a bug fix: a previously-poisoned session
    now recovers on the next turn instead of failing permanently.

Rollback (required for risk: medium and risk: high)

  • Fast rollback command/path: git revert <sha> (single, self-contained commit; the new
    function is additive, so reverting cleanly restores the prior history-wide-count behaviour).
  • Feature flags or config toggles: None. (Operators wanting the prior routing can configure
    a vision_model_provider, which is unaffected by this change.)
  • Observable failure symptoms: grep logs for provider_capability_error with
    capability=vision, or the degrade WARN no vision route for carried-over/tool-result image marker(s); degrading to text-only. A regression would show the capability error re-firing on
    plain-text turns after an image, or an image the user just sent being silently dropped.

…image

Sending an image to a model_provider without vision support (and with no
vision_model_provider configured) raised a provider_capability_error AND
left the [IMAGE:] marker in the long-lived session history. The capability
error was triggered by a history-wide marker count, so every later turn,
even plain text, re-counted the stale marker and re-failed forever. The
RPC/streaming path makes this permanent: it persists the user message into
the session history before the loop runs, so a failed image turn leaves its
marker behind. A single image to a non-vision provider made the session
unusable until restart.

Scope the capability error to the most recent genuine user message:

- providers/multimodal: add count_latest_user_image_markers(), the
  turn-scoped counterpart to count_user_image_markers (skips tool-result
  carriers and older user messages).
- runtime/turn/vision_route: error only when the latest user message
  carries an image (the user just sent something we cannot see); a
  carried-over marker, from an earlier failed turn or a vision to
  non-vision model switch, degrades to text-only (markers stripped) so the
  turn continues. This also covers the model-switch case.

The user is still told once, on the turn they send the image; subsequent
text turns recover instead of re-failing.

Tests:
- providers: count_latest_user_image_markers_scopes_to_newest_user_message
- runtime: run_tool_call_loop_degrades_carried_over_image_on_non_vision_provider
  (end-to-end through the shared engine; asserts the carried-over marker is
  stripped and the plain-text turn succeeds)
@github-actions github-actions Bot added agent Auto scope: src/agent/** changed. provider Auto scope: src/providers/** changed. runtime Auto scope: src/runtime/** changed. labels Jun 22, 2026
@Nillth Nillth added the bug Something isn't working label Jun 22, 2026
@Audacity88 Audacity88 added risk: high Auto risk: security/runtime/gateway/tools/workflows. size: M Auto size: 251-500 non-doc changed lines. labels Jun 22, 2026
@Audacity88 Audacity88 added this to the v0.8.3 milestone Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Auto scope: src/agent/** changed. bug Something isn't working provider Auto scope: src/providers/** changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. runtime Auto scope: src/runtime/** changed. size: M Auto size: 251-500 non-doc changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants