fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705)#2720
fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705)#2720justinhsu1477 wants to merge 4 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughVault sync has been rerouted to ingest per-file content through the memory-tree pipeline via stable per-file ChangesMemory-tree migration and regression fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/vault/sync.rs (1)
136-136: 💤 Low valueUnused field
existing_doc_idafter migration.The
existing_doc_idfield is populated at line 452 but never read in the newprocess_fileimplementation. This appears to be a remnant from the legacydoc_ingestpath.🧹 Remove unused field
struct FileToProcess { rel_path: String, title: String, path: PathBuf, mtime_ms: i64, bytes: u64, ext: String, /// Content hash from the previous successful sync, for secondary dedup. prev_hash: Option<String>, - /// Document ID to update on re-ingest (keeps embedding lineage stable). - existing_doc_id: Option<String>, /// Memory namespace (`vault:<id>`). namespace: String, /// Vault id for tags and state updates. vault_id: String, }And at line 452:
- existing_doc_id: prev.map(|p| p.document_id.clone()),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/vault/sync.rs` at line 136, The field existing_doc_id in the struct (declared as existing_doc_id: Option<String>) is no longer used by the new process_file implementation; remove the unused field and any assignments to it (the population at the site where existing_doc_id is set around the former doc_ingest path, e.g., the code at the location that sets existing_doc_id near line 452) and clean up related variables/usages so the struct and process_file logic only include necessary fields; ensure compilation by removing or refactoring any code that referenced existing_doc_id (search for existing_doc_id, its assignment, and any related comments) and run cargo build/tests to confirm no remaining references.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/vault/sync.rs`:
- Line 136: The field existing_doc_id in the struct (declared as
existing_doc_id: Option<String>) is no longer used by the new process_file
implementation; remove the unused field and any assignments to it (the
population at the site where existing_doc_id is set around the former doc_ingest
path, e.g., the code at the location that sets existing_doc_id near line 452)
and clean up related variables/usages so the struct and process_file logic only
include necessary fields; ensure compilation by removing or refactoring any code
that referenced existing_doc_id (search for existing_doc_id, its assignment, and
any related comments) and run cargo build/tests to confirm no remaining
references.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9d4a2856-3a34-4876-93db-6363db338593
📒 Files selected for processing (1)
src/openhuman/vault/sync.rs
…tion CodeRabbit nitpick on tinyhumansai#2720: `existing_doc_id` was populated from the prior-sync ledger row for the legacy `doc_ingest` path (used as the update key for `memory_docs`), but the memory_tree pipeline keys on `source_id` directly so the field has no reader after the migration. While here, dropping `namespace` for the same reason — the legacy `IngestDocParams.namespace = "vault:<id>"` no longer flows anywhere because the new path encodes the same scope into the stable `source_id = vault:{vault_id}:{rel_path}` and tags chunks with `vault:{vault_id}` directly. No behavior change — purely struct-side cleanup. 3/3 vault::sync regression tests still pass.
…ansai#2720) Pre-existing integration test `vault_sync_roundtrip_updates_memory_and_ledger` asserted the old UnifiedMemory behavior via `list_documents(namespace)` — which was the silent-failure surface this PR set out to fix. After the memory_tree migration in tinyhumansai#2720 (commit 0181448), vault sync no longer writes to `memory_docs`, so the legacy assertion was guaranteed to fail. Replaced the document-list assertions with direct memory_tree probes: * First sync — assert `count_chunks(&config) > 0` and `is_source_ingested(config, SourceKind::Document, "vault:{id}:{path}")` for both files. Also pin the ledger contract: `document_id` now encodes the memory-tree source_id (prefix `vault:`). * Second sync — pin the full lifecycle: - `notes/one.md` content-updated → still registered (delete + re-ingest via `delete_chunks_by_source` then `ingest_pipeline::ingest_document`). - `docs/two.json` file removed → no longer registered (Phase 4 must delete from `mem_tree_ingested_sources`). - `docs/three.toml` new file → freshly registered. Removed the now-dead `documents_from_payload` helper since no caller remains. `memory_global::init` is still called in setup (initialises the in-process memory client the way a real launch would). Verification: `cargo test --test vault_sync_e2e` passes; full lib suite `cargo test --lib vault` 28/28 pass.
|
justinhsu1477 could you please fix the tests and then push? |
…l_sync pollution `memory_ingestion_status_reflects_initialized_client_snapshot` was asserting `status.queue_depth == 1` against the process-global `IngestionState`, on the assumption that the lock acquired at the top of the test guarantees a clean baseline. The two sibling tests (`memory_sync_channel_publishes_targeted_event` and `memory_sync_all_publishes_broadcast_event`) call `memory_sync_channel` / `memory_sync_all` → `spawn_manual_sync`, which detaches a `tokio::spawn(...)` background task that runs `composio::run_connection_sync`. That detached task can enqueue an ingestion job via `MemoryClient::store_skill_sync` → `IngestionQueue::submit` → `state.enqueue()`. By the time this test acquires `GLOBAL_MEMORY_TEST_LOCK`, that background work may already have bumped the global counter — so the equality assertion fires with `left: 2, right: 1`. CI hit this on tinyhumansai#2720 (\"Rust Core Tests + Quality\") despite the lock because the lock only serialises against tests that *also* acquire it — not the detached worker that was spawned earlier and is still draining. The same flake is reachable on `upstream/main` in CI with a parallel-test ordering that puts the snapshot test directly after one of the sync-channel tests; my PR's added vault sync tests didn't cause it, but they shifted the ordering enough to expose it. The principled fix is to snapshot the baseline `queue_depth` at the start of the test and assert the delta. That keeps the test's contract intact (an `enqueue + mark_running` round-trip must surface in the snapshot) without depending on the rest of the suite leaving the global state pristine. Targeted run on the failure window passes: cargo test --lib -- memory::ops::sync vault::sync → 9/9 pass
|
@YellowSnnowmann thanks — pushed `f8b50543` about 5 min before your ping. The failing test (`memory_ingestion_status_reflects_initialized_client_snapshot`) was asserting `queue_depth == 1` against the process-global `IngestionState`. Sibling tests in the same module call `spawn_manual_sync`, which detaches a `tokio::spawn(...)` that runs `composio::run_connection_sync` → `MemoryClient::store_skill_sync` → `IngestionQueue::submit` → `state.enqueue()`. That background task can still be draining when the snapshot test acquires `GLOBAL_MEMORY_TEST_LOCK`, so the global counter isn't 0. Fix is to snapshot the baseline and assert the delta — pre-existing flake on `upstream/main`, the added vault sync tests just shifted the parallel-test ordering enough to expose it. Full reasoning in the commit body. `cargo test --lib -- memory::ops::sync vault::sync` passes 9/9 locally; CI re-running now. |
| // singleton therefore can't be assumed to start at queue_depth=0. | ||
| // Capture the baseline and assert the delta instead. | ||
| let baseline_depth = state.snapshot().queue_depth; | ||
|
|
There was a problem hiding this comment.
🛑 Blocker — unresolved merge conflict against main. Merging this branch with current main leaves git conflict markers around this line, so the merged tree won't compile. Both sides add the same let baseline_depth = state.snapshot().queue_depth; de-flake (main landed an equivalent fix independently) — just with different comments. Resolve by keeping one baseline_depth capture and dropping the duplicate + markers before this can merge.
| chunks_written, | ||
| already_ingested, | ||
| ); | ||
| IngestFileResult::Ingested { |
There was a problem hiding this comment.
Nitpick — false-success edge case. An ingest_document returning Ok(IngestResult { already_ingested: true, chunks_written: 0, .. }) still lands in this Ingested arm and writes a "success" ledger row. The delete-first guard above prevents this on the normal update path, but on a ledger↔memory-tree desync (ledger wiped while the mem_tree_ingested_sources row survives) it would report success while nothing reaches retrieval — the exact false-success mode this PR exists to kill. Consider a log::warn! (or a distinct result) when chunks_written == 0 && already_ingested.
| // pre-ingest cleanup on content updates. Cloning per-task keeps the | ||
| // borrow life-cycle simple inside `buffer_unordered` (no `&'a Config` | ||
| // bouncing through closures). | ||
| let config_for_workers = config.clone(); |
There was a problem hiding this comment.
Nitpick — per-file Config clone. config.clone() runs once per candidate file as the stream is polled. buffer_unordered keeps only ~4 alive at a time, so memory is bounded, but a large vault still performs N full Config clones. Wrapping in Arc<Config> and cloning the Arc would make this O(1)-per-file. Minor.
| // ledger rows that pre-date the migration and still carry a legacy | ||
| // UnifiedMemory document_id. | ||
| let stored_id = prev.document_id.clone(); | ||
| let source_id = if stored_id.starts_with("vault:") { |
There was a problem hiding this comment.
Question — legacy doc_id fallback is a no-op for genuinely-legacy rows. For ledger rows that pre-date #2705, the stored document_id is a UnifiedMemory id whose data lived in memory_docs, never in memory_tree. Recomputing a vault:{id}:{path} source_id and calling delete_chunks_by_source therefore deletes nothing, and the old UnifiedMemory doc is never cleaned up when such a file is removed. Intentional (deferred to the UnifiedMemory-removal follow-up), or should the fallback branch also doc_delete(prev.document_id) during the migration window so legacy rows don't leak?
sanil-23
left a comment
There was a problem hiding this comment.
PR #2720 — fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705)
Walkthrough
This PR migrates the vault sync path off the legacy UnifiedMemory / memory_docs backend onto the memory-tree pipeline (mem_tree_chunks + mem_tree_ingested_sources), which is what every modern retrieval surface actually reads from. process_file now ingests via ingest_pipeline::ingest_document against a stable source_id = vault:{vault_id}:{rel_path}, deletes prior chunks before re-ingesting on content updates (to defeat the content-blind already_ingested gate), and routes vanished-file deletions through delete_chunks_by_source. The diagnosis and the core sync fix are sound and well-tested.
However, this PR cannot merge as-is. Two issues block it: an unresolved merge conflict against current main, and a stale legacy-backend call in the vault-removal/purge path that this migration leaves orphaning data — the same silent-failure class the PR set out to fix.
Changes
| File | Summary |
|---|---|
src/openhuman/vault/sync.rs |
Core fix: ingest via memory-tree pipeline with stable vault:{id}:{rel} source ids; delete-then-reingest on content change; Phase-4 deletions via delete_chunks_by_source; migration-safe legacy doc_id fallback; 3 new unit tests. |
tests/vault_sync_e2e.rs |
Updated e2e to assert against mem_tree_chunks / mem_tree_ingested_sources instead of list_documents; drops the unused documents_from_payload helper. |
src/openhuman/memory/ops/sync.rs |
Test-only: capture queue_depth baseline and assert delta (de-flake). Currently carries unresolved conflict markers vs main. |
Actionable comments (2)
🛑 Blockers
1. src/openhuman/memory/ops/sync.rs:370-387 — Unresolved merge conflict against main
Merging this branch with the current main leaves git conflict markers in this file, so the merged tree does not compile. Both sides introduce the same fix — let baseline_depth = state.snapshot().queue_depth; — with different explanatory comments; main independently landed an equivalent de-flake. Trivial to resolve (keep one capture, drop the duplicate and the markers), but it must be resolved before this PR can merge. See the inline comment for the suggested resolution.
⚠️ Major
2. src/openhuman/vault/ops.rs:116-136 — vault_remove(purge_memory=true) still purges the legacy backend, orphaning all memory-tree data
Not in the diff, but a direct downstream consequence of this migration. Vault content now lives in mem_tree_chunks / mem_tree_ingested_sources (keyed vault:{id}:{rel_path}), yet the purge path still calls clear_namespace(v.namespace) → MemoryClient::clear_namespace, which only touches the legacy vectors / memory_docs tables (memory_store/vectors/store.rs:386) — tables this PR no longer writes to.
Result: removing a vault with purge now deletes nothing meaningful and orphans every mem_tree_chunks row, so retrieval keeps surfacing content from a deleted vault. That is the exact silent-failure class this PR fixes, re-appearing on the removal side. The in-sync Phase-4 deletion only covers per-file vanishing, not whole-vault removal.
The codebase already has the helper and an established pattern for this (composio/ops.rs:560, channels/controllers/ops.rs:565):
use crate::openhuman::memory_store::chunks::store::delete_chunks_by_source_prefix;
use crate::openhuman::memory_store::chunks::types::SourceKind;
let cfg = config.clone();
let prefix = format!("vault:{id}:");
let purged = tokio::task::spawn_blocking(move || {
delete_chunks_by_source_prefix(&cfg, SourceKind::Document, &prefix)
})
.await;
// (optionally also keep clear_namespace for migration-window legacy rows)At minimum, acknowledge this in the PR body's "Out of scope" section if you intend to defer it.
Nitpicks (3)
src/openhuman/vault/sync.rs:250-270— A successfulingest_documentreturningalready_ingested: truewithchunks_written == 0is still counted asIngestedand writes a "success" ledger row. The delete-first guard prevents this on the normal path, but a ledger↔memory-tree desync would re-introduce false-success reporting. Consider alog::warn!whenchunks_written == 0 && already_ingested. (inline)src/openhuman/vault/sync.rs:474-476—Configis cloned once per candidate file;Arc<Config>would avoid N copies on large vaults. (inline)tests/vault_sync_e2e.rs:117-145vssrc/openhuman/vault/sync.rs:748-766— inconsistentspawn_blockingconvention between the e2e and unit tests for the sameis_source_ingested/count_chunkscalls; pick one.sync_writes_to_memory_treecould also tightenchunks_after > chunks_beforetochunks_before == 0(fresh tempdir).
Questions for the author (2)
src/openhuman/vault/sync.rs:543-552— for genuinely pre-#2705 ledger rows the recomputed memory-tree source_id has no chunks (data was inmemory_docs), sodelete_chunks_by_sourceis a no-op and the legacy doc leaks on deletion. Deferred, or should the fallback alsodoc_deletethe legacy id? (inline)src/openhuman/vault/types.rs:56— isVault.namespacestill meaningful post-migration? Its only remaining consumer is the (now-misdirected) purge call; fixing #2 by prefix-delete may make it vestigial.
Verified / looks good
delete_chunks_by_sourceclears bothmem_tree_chunksandmem_tree_ingested_sources(memory_store/chunks/store.rs:809-832), so the delete-then-reingest content-update path genuinely resets thealready_ingestedgate. The central design assumption holds.- Error handling is fail-closed: a failed pre-reingest delete returns
Failedrather than re-ingesting alongside stale chunks; the next sync self-heals. delete_chunks_by_sourceand the Phase-4 deletion correctly run underspawn_blocking(SQLite is sync I/O).sync_writes_to_memory_treeprecisely pins the #2705 regression (chunk count up + both source ids registered + ledger stores thevault:source id); the e2e exercises the full update / delete / add lifecycle.
…yhumansai#2705) Before this fix, `vault::sync::sync_vault` walked the vault directory, read each file, and called `doc_ingest` → `MemoryClient::ingest_doc` → `UnifiedMemory::ingest_document`. UnifiedMemory wrote to the legacy `memory_docs` table. The user's "synced" message was therefore technically correct (the legacy backend accepted the writes), but every modern retrieval surface — `memory.search`, `tree.read_chunk`, `tree.browse`, the agent's recall path, the summary-tree builder — reads from the memory-tree backend (`mem_tree_chunks` + `mem_tree_ingested_sources`). Vault data was invisible to all of them. That explains the entire bug report (tinyhumansai#2705): * `SELECT COUNT(*) FROM mem_tree_chunks` returns 0 (vault sync never wrote there) * `SELECT COUNT(*) FROM mem_tree_ingested_sources` returns 0 (same) * "Memory wiped: 0 rows removed" (wipe clears memory_tree, which was already empty) * "Build summary trees" produces no jobs (memory_tree has no chunks) * Agent can't recall vault content (agent retrieval reads memory_tree) Root cause: recent PRs (tinyhumansai#2585, tinyhumansai#2556, tinyhumansai#2574) migrated RAG primitives to memory_tree as the canonical layer, but the vault sync path was not migrated alongside it. ## Fix `process_file` now calls `memory::ingest_pipeline::ingest_document` directly with a stable `source_id` of the form `vault:{vault_id}:{rel_path}`. The pipeline writes to `mem_tree_chunks` + `mem_tree_ingested_sources` — exactly the tables the modern retrieval stack reads from. Three additional design choices: 1. **Content-update path** — the pipeline's `already_ingested( SourceKind::Document, source_id)` gate is content-blind and the source_id is stable per file path. For real content updates the vault layer drops prior chunks via `memory_store::chunks::store::delete_chunks_by_source` *before* the re-ingest, otherwise the new content gets short-circuited. The vault ledger's `content_hash` check still gates whether we run the delete+reingest at all, so untouched files cost zero pipeline work. 2. **Deletion path** — Phase 4 (`by_path` entries the walk didn't see) now calls `delete_chunks_by_source` instead of the legacy `doc_delete`. Handles migration too: ledger entries whose stored `document_id` doesn't start with `vault:` (rows persisted before this fix) fall back to a recomputed source_id rather than failing. 3. **Vault ledger** — the `VaultFile.document_id` field is now the memory-tree source_id. Schema column name unchanged for backward compat with persisted rows; only the semantic of what we store changed. Deletion uses the prefix-check above to handle the migration window. ## Tests Three new regression tests in `vault::sync::sync_tests` (28/28 vault suite passes): * `sync_writes_to_memory_tree` — the tinyhumansai#2705 regression. Creates a vault with two .md files, runs `sync_vault`, asserts `count_chunks` goes up and both source_ids appear in `mem_tree_ingested_sources`. Also pins the ledger contract: `document_id` must start with `vault:` so the deletion-path prefix check stays correct. * `second_sync_with_no_changes_is_idempotent` — pins that re-sync with unchanged content does not duplicate chunks (the vault-layer hash dedup guards the pipeline). * `vault_source_id_is_stable_and_namespaced` — unit test on the id format itself; defends against an accidental rename that would break the `already_ingested` gate's cross-file isolation. `cargo test --lib vault` 28/28 pass. `cargo check --lib` + `cargo fmt --check` + `cargo test --tests --no-run` all clean (lesson from tinyhumansai#2603 — must compile the integration-test target before push). ## Out of scope (separate audits / PRs) * Composio providers and agent_experience still call `doc_ingest` → UnifiedMemory. If they have a similar gap, that's a separate audit; this PR is scoped to the vault path the user reported. * Removing `UnifiedMemory` entirely is the larger follow-up senamakel listed on tinyhumansai#2585; out of scope here. * In-flight tracing during phase 2 / 3 is left at the existing `log::debug!` level — the new ingestion log lines along with the pre-existing entry/exit logs cover the silent-failure surface the issue's reporter would have used to triage. Closes tinyhumansai#2705.
…tion CodeRabbit nitpick on tinyhumansai#2720: `existing_doc_id` was populated from the prior-sync ledger row for the legacy `doc_ingest` path (used as the update key for `memory_docs`), but the memory_tree pipeline keys on `source_id` directly so the field has no reader after the migration. While here, dropping `namespace` for the same reason — the legacy `IngestDocParams.namespace = "vault:<id>"` no longer flows anywhere because the new path encodes the same scope into the stable `source_id = vault:{vault_id}:{rel_path}` and tags chunks with `vault:{vault_id}` directly. No behavior change — purely struct-side cleanup. 3/3 vault::sync regression tests still pass.
…ansai#2720) Pre-existing integration test `vault_sync_roundtrip_updates_memory_and_ledger` asserted the old UnifiedMemory behavior via `list_documents(namespace)` — which was the silent-failure surface this PR set out to fix. After the memory_tree migration in tinyhumansai#2720 (commit 0181448), vault sync no longer writes to `memory_docs`, so the legacy assertion was guaranteed to fail. Replaced the document-list assertions with direct memory_tree probes: * First sync — assert `count_chunks(&config) > 0` and `is_source_ingested(config, SourceKind::Document, "vault:{id}:{path}")` for both files. Also pin the ledger contract: `document_id` now encodes the memory-tree source_id (prefix `vault:`). * Second sync — pin the full lifecycle: - `notes/one.md` content-updated → still registered (delete + re-ingest via `delete_chunks_by_source` then `ingest_pipeline::ingest_document`). - `docs/two.json` file removed → no longer registered (Phase 4 must delete from `mem_tree_ingested_sources`). - `docs/three.toml` new file → freshly registered. Removed the now-dead `documents_from_payload` helper since no caller remains. `memory_global::init` is still called in setup (initialises the in-process memory client the way a real launch would). Verification: `cargo test --test vault_sync_e2e` passes; full lib suite `cargo test --lib vault` 28/28 pass.
graycyrus
left a comment
There was a problem hiding this comment.
[APPROVED]
This is a solid fix for #2705. Vault sync now correctly routes through the memory-tree pipeline (mem_tree_chunks + mem_tree_ingested_sources) instead of the legacy UnifiedMemory backend. The migration is clean: stable per-file source_ids ensure idempotency, content updates properly delete stale chunks before re-ingesting, and pre-fix ledger rows are handled gracefully.
Tests are comprehensive — three unit tests pin the regression and the invariants, plus e2e validation of the full lifecycle. Code patterns are solid (async/blocking correctly separated, error context preserved, logging appropriate).
No issues. Ship it.
…i#2720 Three improvements from @sanil-23's review pass: 1. **Surface ledger↔memory_tree desync as a `warn!` log instead of a silent success** (L263). When `ingest_document` returns `Ok { already_ingested: true, chunks_written: 0 }` we still land in the `Ingested` arm and write a fresh ledger row — but no chunks actually reach retrieval. The delete-first guard above prevents this on the normal update path, so seeing it means ledger and memory_tree are out of sync. That's the exact silent-failure mode this PR set out to kill, so it now logs at `warn!` with the suggestion for a manual `delete_chunks_by_source` resync. 2. **Share a single `Config` allocation across the buffer_unordered workers via `Arc<Config>`** (L474). The previous loop did one `config.clone()` per candidate file. With `Arc` a 5k-file vault pays one clone + N atomic ref-count bumps instead of N full `Config` deep-clones — measurably cheaper on cold backfills. Signature change: `process_file(config: Arc<Config>, ...)`. 3. **Run a parallel UnifiedMemory `doc_delete` on the legacy-ledger fallback path** (L544). Pre-tinyhumansai#2705 ledger rows store a UnifiedMemory `{ts}_{hex}` id whose data lives in `memory_docs`, not `mem_tree_*`. Recomputing the memory_tree source_id and running `delete_chunks_by_source` deletes nothing on those rows — so without a parallel `doc_delete` the legacy data leaked until UnifiedMemory removal lands (tinyhumansai#2585 follow-up). The deletion path now does both during the migration window so vanished files actually go away. `doc_delete` failures on the legacy path are best-effort: a 404 / already-absent there shouldn't block the canonical `delete_chunks_by_source` cleanup below. Tests: 9/9 in `memory::ops::sync` + `vault::sync` pass. `cargo check --lib`, `cargo fmt --check`, and `cargo test --tests --no-run` all clean.
e7c61b1
f8b5054 to
e7c61b1
Compare
|
@sanil-23 thanks for the thorough pass — all four addressed in `e7c61b14` (rebased on `upstream/main` so the #2737 conflict is gone). 🛑 L378 conflict — Resolved by rebase: dropped my standalone `baseline_depth` commit since #2737 already landed the same de-flake with a cleaner `state.reset_for_test()` helper. Kept that, dropped mine. L263 false-success — Added a `log::warn!` for the `already_ingested == true && chunks_written == 0` branch. The delete-first guard prevents this on the normal update path, so seeing it means ledger ↔ memory_tree desync — exactly the silent mode this PR exists to kill. Log line names the source_id + suggests a manual `delete_chunks_by_source` resync. L474 per-file Config clone — Switched to `Arc`. `process_file` now takes `Arc`, the call site does one outer `Arc::new(config.clone())` then `Arc::clone(&...)` per candidate. A 5k-file backfill pays one deep clone + N ref-count bumps instead of N full clones. L544 legacy fallback no-op — You're right, it was silently leaving legacy `memory_docs` rows behind. Migration-window fix: when `document_id` doesn't start with `vault:` (= pre-#2705 ledger row), the deletion path now runs `doc_delete(prev.document_id)` against UnifiedMemory in parallel with `delete_chunks_by_source` against the recomputed memory_tree source_id. `doc_delete` failures are best-effort (a 404 on the legacy path shouldn't block the canonical memory_tree cleanup). `cargo test --lib -- memory::ops::sync vault::sync` → 9/9 pass. Pushed as force-with-lease since the rebase was unavoidable. |
Closes #2705.
Root cause
vault::sync::sync_vaultwalked the directory, calleddoc_ingest→ `MemoryClient::ingest_doc` → `UnifiedMemory::ingest_document` → `memory_docs` table. The UI's "synced" message was technically correct (UnifiedMemory accepted the writes), but every modern retrieval surface — `memory.search`, `tree.read_chunk`, `tree.browse`, the agent's recall path, the summary-tree builder — reads from the memory-tree backend (`mem_tree_chunks` + `mem_tree_ingested_sources`).Vault data was invisible to all of them.
That explains the entire bug report:
Recent PRs (#2585, #2556, #2574) migrated RAG primitives to memory_tree as the canonical layer; the vault sync path was not migrated alongside.
Fix
`process_file` now calls `memory::ingest_pipeline::ingest_document` directly with a stable `source_id = vault:{vault_id}:{rel_path}`. The pipeline writes to `mem_tree_chunks` + `mem_tree_ingested_sources` — the tables the modern retrieval stack reads from.
Three design choices worth calling out:
Content-update path — the pipeline's `already_ingested` gate is content-blind and the source_id is stable per file path. For real content updates the vault layer drops prior chunks via `memory_store::chunks::store::delete_chunks_by_source` before the re-ingest, otherwise new content gets short-circuited. The vault ledger's `content_hash` check still gates whether we run delete+reingest at all, so untouched files cost zero pipeline work.
Deletion path (Phase 4) — `by_path` entries the walk didn't see now call `delete_chunks_by_source` instead of `doc_delete`. Migration-safe: ledger entries whose stored `document_id` doesn't start with `vault:` (rows persisted before this fix) fall back to a recomputed source_id.
Vault ledger semantic — the `VaultFile.document_id` field now holds the memory-tree source_id. Schema column name unchanged for backward compatibility with persisted rows; only the semantic of what we store changed. Deletion uses the prefix-check above to handle the migration window cleanly.
Tests
Three new regression tests in `vault::sync::sync_tests`:
Out of scope (separate audits / PRs)
Refs #2705.
Summary by CodeRabbit
Improvements
Tests