fix(memory): resolve dotted embedding provider refs against providers.models#8152
Conversation
….models `[[embedding_routes]]` (and `[memory].embedding_provider`) accept a dotted `<type>.<alias>` provider reference, which passes config validation but was never resolved by the embeddings factory — `create_embedding_provider` only matches `openai` / `openrouter` / `custom:`, so a ref like `openai.default` fell through to `NoopEmbedding`, silently degrading retrieval to keyword-only with no error logged. The embedding-routes feature was effectively dead for any dotted provider ref. Resolve the dotted ref against the canonical `providers.models` catalog when finalizing the embedding profile: an explicit `uri` override becomes `custom:<uri>`, otherwise the bare provider kind is passed through so the factory applies its built-in family default endpoint. Key precedence is per-route / `[memory]` override > the referenced provider's own key > inherited chat key. An unresolvable ref is left intact and logged loudly instead of degrading silently. The catalog is threaded through `create_memory_with_storage_and_routes` from every runtime call site (agent, gateway, daemon RPC, memory CLI, owned-state cascade) and read on demand — no provider state is cached. Non-dotted providers (`openai`, `openrouter`, `custom:<url>`, `none`) are unchanged. Closes zeroclaw-labs#7949
|
Pushed a follow-up ( Resolved families with no embeddings endpoint no longer degrade silently. Heartbeat recall now honors embedding routes. Stable Metadata. Suggesting One thing to flag for reviewers — live provider-state propagation. Validation after the changes:
|
…t routes Follow-up hardening for the zeroclaw-labs#7949 fix: - A dotted ref that resolves in `providers.models` to a family with no usable embeddings endpoint (a non-`openai`/`openrouter` family configured without a `uri`, e.g. `custom.<alias>`) was mapped to its bare kind, which the embeddings factory does not recognize — so it silently fell back to `NoopEmbedding`, bypassing the new unresolved-reference warning. Only pass `openai`/`openrouter` through bare; otherwise leave the reference unresolved and log loudly. A configured route now never degrades in silence. - Route the daemon heartbeat recall through the routes-aware memory factory with the provider catalog, matching the gateway/RPC paths; previously it used the bare `create_memory` entrypoint (empty routes, no catalog) so `[[embedding_routes]]` / dotted refs were ineffective for heartbeat recall. - Add stable `error_key` fields (`memory.embedding_route_unresolved`, `memory.embedding_route_no_endpoint`) to the diagnostic WARN events. Adds regression tests for the resolved-but-no-endpoint case (mutation-checked) and for a `custom` profile with an explicit `uri`.
1021863 to
9c9497f
Compare
…routes The keyword-only fallback assertion alone stays green even if the WARN is removed — so it does not actually guard the "never silent" contract that separates an acceptable loud fallback from the original zeroclaw-labs#7949 bug. Add a test that captures the zeroclaw-log broadcast event and asserts severity = WARN plus the stable `error_key = memory.embedding_route_no_endpoint` (with provider_ref / provider_kind). Mutation-checked: deleting the WARN emission fails this test.
|
Another follow-up ( Diagnostic is now asserted, not just the fallback. Good catch — the On resolving provider endpoint/key from live config vs. construction-time materialization — I want to lay out why I've scoped this as a follow-up rather than fold it into this PR, and I'm happy to be overruled:
The correct long-term fix is either live resolution in the embedder or a memory-refresh hook on Validation after these changes:
|
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
Review — PR #8152
fix(memory): resolve dotted embedding provider refs against providers.models
Author: OmkumarSolanki (Omkumar Solanki) | Head: 2aa4cd75f68abdd051ecae3e9b5a1cf5139943a5
Reviewer: WareWolf-MoonWall | Verdict: --approve
🟢 Resolution correctness — find_by_name splits on first dot; all edge cases handled
resolve_provider_ref correctly detects dotted refs via:
let is_dotted_ref =
!trimmed.is_empty() && !trimmed.starts_with("custom:") && trimmed.contains('.');custom: URLs are excluded (they contain a colon, not a dot leading to a provider alias). find_by_name in providers.rs splits on the first dot via name.split_once('.'), performs a typed map lookup by kind, and returns None on an unknown kind or missing alias.
Edge cases:
- Missing provider ref:
find_by_namereturnsNone→ logged at WARN witherror_key=memory.embedding_route_unresolved, original dotted string preserved asmodel_provider, factory getsNoopEmbedding. No silent degradation. - Resolved family with no embeddings endpoint: only
openai/openrouterpass through bare; any other family without auri(e.g. acustom.<alias>without auri) hits theNonearm → logged at WARN witherror_key=memory.embedding_route_no_endpoint. No silent degradation. - Resolved family with explicit
uri: becomescustom:<uri>, which the embeddings factory handles as any OpenAI-compatible endpoint. - Circular refs: not possible — provider names are plain strings, not recursive structures.
- Deeply nested refs: not applicable —
find_by_namesplits on first dot only, which is the correct<kind>.<alias>format. A ref likeopenai.default.extrawould be treated as kind=openai, alias=default.extra, which would miss in the alias map and returnNone, logging loudly.
🟡 Warning — construction-time snapshot, not live resolution; acknowledged, follow-up needed
OpenAiEmbedding holds its resolved base_url/api_key at construction time. A long-lived install-wide RpcContext.memory built at daemon startup will not observe a config/set hot-update to the referenced provider profile until the daemon restarts or a memory-refresh hook (analogous to refresh_live_sessions_for_model_provider for chat) is added.
The author has been transparent about this limitation, explained the pre-existing pattern, and documented the startup-ordering constraint that prevents live resolution in a single PR. This is accepted as a scoped limitation for this fix. However, a follow-up issue tracking the memory-refresh hook (or an on-demand resolver pattern) should be filed and linked. The existing pattern of snapshot-at-build is pre-existing behavior, not newly introduced by this PR.
🟢 SSOT — no new persistent config field introduced
ResolvedEmbeddingConfig is a transient local, computed and consumed within create_memory_with_storage_and_routes. The new providers: Option<&ModelProviders> parameter is a borrow resolved at call time — it is NOT stored in any struct field. No struct in zeroclaw-memory or its callers gains a cached copy of provider state.
🟢 Error handling — loud WARN, stable error keys, not silently swallowed
Both failure modes emit structured WARN events through zeroclaw_log::record!:
error_key=memory.embedding_route_unresolved— ref not inproviders.modelserror_key=memory.embedding_route_no_endpoint— resolved family has no usable embeddings endpoint
These are log events (not user-facing CLI output), so English messages with stable error_key fields are correct per AGENTS.md (§ RFC #5653 §4.6). No fl!() required here.
The Rollback section correctly cites both error_key values as observable failure symptoms, matching the code.
🟢 Fluent / i18n — no new user-facing strings
No new CLI output, no new user-facing error messages. All new text is log-level WARN events with English messages and stable error_key attributes. Correct.
🟡 Warning — risk label missing; should be risk: high
The PR labels show daemon, gateway, memory, runtime with no risk label. Per AGENTS.md: "High risk: crates/zeroclaw-runtime/src/** (especially src/security/), crates/zeroclaw-gateway/src/** …" — this PR touches both. The author correctly identified risk: high in the PR body but lacks permission to apply the label. A maintainer should set risk: high and size: M before merge.
🟢 Rollback — complete, with SHAs and observable symptoms
git revert 9c9497fd4 14fe60917
Two self-contained commits, revert restores prior keyword-only behavior. Observable symptoms and feature-flag disable path ([memory].embedding_provider = "none") documented. Correct for a risk: high change.
🟢 Heartbeat worker now honors embedding routes
run_heartbeat_worker was previously using the bare create_memory entrypoint (empty routes, no catalog), causing [[embedding_routes]] to be ignored during heartbeat recall. Routing it through create_memory_with_storage_and_routes with config.embedding_routes and the catalog closes this gap. This was correctly self-caught and addressed in the follow-up commit.
🟢 Tests — thorough, mutation-checked, diagnostic asserted
- 19
resolve_embedding_config*tests covering all code paths. - 7 new regression tests for dotted-ref resolution.
resolve_embedding_config_no_endpoint_emits_loud_warningcaptures the broadcast log event and assertsseverity_text == "WARN"andattributes.error_key == "memory.embedding_route_no_endpoint"— this is the correct way to guard the "never silent" contract; a fallback-only assertion would stay green if the WARN were deleted.- Mutation checks performed: forcing
is_dotted_ref = falsefails 4 tests; restoring the old silentNone => kindmapping fails the no-endpoint test. - 9435 total tests passing, 11 skipped.
🟢 PR template — complete
All required sections present: Summary (detailed scope boundary, blast radius), Validation Evidence (literal command output with full nextest run), Security & Privacy Impact (checklist, api_key handling correctly noted), Compatibility (breaking internal API change scoped and explained), Rollback.
🟢 CI — fully green
All 18 quality-gate checks pass: Format, Lint, Test, Build x86_64, Check matrix (32-bit, all-features, no-default, aarch64-apple-darwin, x86_64-pc-windows-msvc), Docs Style, Installer Drift, Nix Module Eval, Security, Zerocode RPC Boundary, CI Required Gate.
Summary
Verdict: --approve. The fix correctly resolves a dead-feature bug ([[embedding_routes]] dotted refs silently degrading to Noop), handles all relevant edge cases loudly, introduces no duplicate state, is thoroughly tested, and has a complete PR template and full green CI.
Two action items for maintainer before squash-merge:
- Apply
risk: highandsize: Mlabels (author flagged this; self-assessed correctly). - File a follow-up issue tracking live provider-state propagation for the memory embedding backend (construction-time snapshot vs.
config/sethot-reload), linking to the discussion in this PR's comments.
|
Labels note for maintainers: the diff touches memory + runtime + gateway resolution paths — the auto-applied labels show no |
Summary
master[[embedding_routes]](and[memory].embedding_provider) accept a dotted<type>.<alias>provider reference. It passes config validation, but the embeddings factory (create_embedding_provider) only matchesopenai/openrouter/custom:— so a ref likeopenai.defaultfell through toNoopEmbeddingand silently degraded retrieval to keyword-only, with no error logged. The embedding-routes feature was effectively dead for any dotted ref (issue [Bug]: [[embedding_routes]] silently degrades to NoopEmbedding (route feature effectively dead) #7949).providers.modelscatalog when finalizing the embedding profile, reusing the existingModelProviders::find_by_name: an explicituribecomescustom:<uri>; with nouri, anopenai/openrouterfamily passes through to the factory's built-in default endpoint.openai/openrouterfamily configured without auri, e.g.custom.<alias>), is left unresolved and logged loudly with a stableerror_key.[memory]override > the referenced provider's own key > inherited chat-seed key.create_memory_with_storage_and_routesfrom every runtime construction site — agent-scoped factory, gateway, daemon RPC, daemon heartbeat recall, memory CLIreindex, owned-state cascade — and read on demand at construction; no new persistent config field is introduced.EmbeddingProvidertrait /OpenAiEmbeddingHTTP client, non-dotted provider values (openai,openrouter,custom:<url>,nonestay byte-for-byte identical), or any non-embedding provider resolution path. Does not add config keys or feature flags, and adds no struct fields.zeroclaw-memorybackend construction and its callers (gateway, runtime daemon incl. heartbeat, root binary CLI/alias paths). Only profiles whose embeddingmodel_provideris a dotted ref change behavior; everything else is unchanged.create_memory_with_storage_and_routesgained one trailingOption<&ModelProviders>parameter (Beta-tier crate; all in-tree callers updated).risk:/size:):bug,risk: high(per AGENTS.md path rules this touchescrates/zeroclaw-runtime/src/**andcrates/zeroclaw-gateway/src/**, which classify as high — overriding the issue's originalrisk: medium),size: M,memory,config,runtime,gateway,daemon,priority:p2.Validation Evidence (required)
Environment:
rustc 1.96.0 (ac68faa20 2026-05-25). Full battery, literal tails:zeroclaw-memory(19resolve_embedding_config*tests total, all green): dotted route ref →custom:<uri>+ provider key (with an end-to-end assert that the built embedder is"openai", i.e. real, not the"none"Noop); dotted ref withouturi→ bareopenai/openrouter; route key overrides provider key; unresolvable ref left intact (not silent); dotted baseembedding_providerresolves; resolved-but-no-endpoint family (custom.<alias>withouturi) left unresolved (not silently mapped to a working-looking provider);customprofile withuriresolves.is_dotted_ref = falsemakes 4 dotted-resolution tests fail ("openai.default"vscustom:..., embedder name"none"); (2) restoring the oldNone => kindsilent mapping makes the no-endpoint test fail (left: "custom"vsright: "custom.myembed").Nonecatalog → identical behavior). Verified feature-gated call sites compile under--features gateway,agent-runtime.Security & Privacy Impact (required)
providers.models.api_keyis read fromproviders.modelsand used for the embedding request, with precedence per-route/[memory]override > referenced provider key > inherited chat key. Keys are read on demand from canonical config; the WARN logs record only the provider reference string / family name, never the key.api.example.com/embed.local/ placeholdersk-*).Compatibility (required)
zeroclaw_memory::create_memory_with_storage_and_routesgained a trailingproviders: Option<&ModelProviders>parameter — Beta-tier crate, all in-tree callers updated in this PR.[[embedding_routes]]with a dottedmodel_providerthat points at an OpenAI-compatible profile will see embeddings start working; a route pointing at a family with no embeddings endpoint now logs a clear WARN instead of silently doing nothing.Rollback (required for
risk: mediumandrisk: high)git revert 9c9497fd4 14fe60917(two self-contained commits on this branch), or revert the squash-merge commit.[memory].embedding_provider = "none"/ removing the route, exactly as before.error_key=memory.embedding_route_unresolved(ref not in catalog),error_key=memory.embedding_route_no_endpoint(resolved family has no embeddings endpoint), orEmbedding API error <status>(runtime failure from a newly-active operator endpoint). If embeddings misbehave, the revert restores the prior keyword-only fallback behavior.