Skip to content

fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705)#2720

Open
justinhsu1477 wants to merge 4 commits into
tinyhumansai:mainfrom
justinhsu1477:fix/vault-sync-silent-failure
Open

fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705)#2720
justinhsu1477 wants to merge 4 commits into
tinyhumansai:mainfrom
justinhsu1477:fix/vault-sync-silent-failure

Conversation

@justinhsu1477
Copy link
Copy Markdown
Contributor

@justinhsu1477 justinhsu1477 commented May 27, 2026

Closes #2705.

Root cause

vault::sync::sync_vault walked the directory, called doc_ingest → `MemoryClient::ingest_doc` → `UnifiedMemory::ingest_document` → `memory_docs` table. The UI's "synced" message was technically correct (UnifiedMemory accepted the writes), but every modern retrieval surface — `memory.search`, `tree.read_chunk`, `tree.browse`, the agent's recall path, the summary-tree builder — reads from the memory-tree backend (`mem_tree_chunks` + `mem_tree_ingested_sources`).

Vault data was invisible to all of them.

That explains the entire bug report:

Symptom Explanation
`mem_tree_chunks` = 0 vault sync never wrote there
`mem_tree_ingested_sources` = 0 same
"Memory wiped: 0 rows removed" wipe clears memory_tree, which was already empty
"Build summary trees" produces no jobs memory_tree has no chunks to summarise
UI "N file(s) · synced" vault state correctly reported the UnifiedMemory ingest succeeded — to the wrong table
Agent can't recall vault content agent retrieval reads memory_tree

Recent PRs (#2585, #2556, #2574) migrated RAG primitives to memory_tree as the canonical layer; the vault sync path was not migrated alongside.

Fix

`process_file` now calls `memory::ingest_pipeline::ingest_document` directly with a stable `source_id = vault:{vault_id}:{rel_path}`. The pipeline writes to `mem_tree_chunks` + `mem_tree_ingested_sources` — the tables the modern retrieval stack reads from.

Three design choices worth calling out:

  1. Content-update path — the pipeline's `already_ingested` gate is content-blind and the source_id is stable per file path. For real content updates the vault layer drops prior chunks via `memory_store::chunks::store::delete_chunks_by_source` before the re-ingest, otherwise new content gets short-circuited. The vault ledger's `content_hash` check still gates whether we run delete+reingest at all, so untouched files cost zero pipeline work.

  2. Deletion path (Phase 4) — `by_path` entries the walk didn't see now call `delete_chunks_by_source` instead of `doc_delete`. Migration-safe: ledger entries whose stored `document_id` doesn't start with `vault:` (rows persisted before this fix) fall back to a recomputed source_id.

  3. Vault ledger semantic — the `VaultFile.document_id` field now holds the memory-tree source_id. Schema column name unchanged for backward compatibility with persisted rows; only the semantic of what we store changed. Deletion uses the prefix-check above to handle the migration window cleanly.

Tests

Three new regression tests in `vault::sync::sync_tests`:

Test What it pins
`sync_writes_to_memory_tree` The #2705 regression. Creates a vault with two .md files, runs `sync_vault`, asserts `count_chunks` goes up + both source_ids appear in `mem_tree_ingested_sources` + ledger `document_id` starts with `vault:`
`second_sync_with_no_changes_is_idempotent` Re-sync with unchanged content does not duplicate chunks (vault-layer hash dedup guards the pipeline)
`vault_source_id_is_stable_and_namespaced` Unit test on the id format — defends against an accidental rename breaking cross-file isolation

Out of scope (separate audits / PRs)

Refs #2705.

Summary by CodeRabbit

  • Improvements

    • Vault file synchronization now efficiently detects and skips redundant processing of unchanged content.
    • Improved cleanup of deleted vault files from the system.
  • Tests

    • Added regression tests for vault sync reliability and consistency.
    • Enhanced end-to-end test coverage for sync update, deletion, and addition workflows.

Review Change Stack

@justinhsu1477 justinhsu1477 requested a review from a team May 27, 2026 00:57
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 66601b04-60c0-4920-aa97-c86d2cd6567b

📥 Commits

Reviewing files that changed from the base of the PR and between f8b5054 and e7c61b1.

📒 Files selected for processing (2)
  • src/openhuman/vault/sync.rs
  • tests/vault_sync_e2e.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/vault_sync_e2e.rs
  • src/openhuman/vault/sync.rs

📝 Walkthrough

Walkthrough

Vault sync has been rerouted to ingest per-file content through the memory-tree pipeline via stable per-file vault:{vault_id}:{rel_path} source IDs, with SHA-256-based change detection, pre-deletion of stale chunks, and migration support for legacy ledger entries. Regression tests and e2e assertions now validate memory-tree table population and idempotency.

Changes

Memory-tree migration and regression fix

Layer / File(s) Summary
Module setup and stable source ID helper
src/openhuman/vault/sync.rs (1–59)
Module documentation describes the memory-tree ingestion path and stable vault:{vault_id}:{rel_path} source-id semantics; vault_source_id(vault_id, rel_path) helper introduced.
Per-file data structures and process_file signature
src/openhuman/vault/sync.rs (132–139, 160–165)
FileToProcess struct reshaped to retain only memory-tree-specific fields (vault_id, prev_hash); process_file signature updated to accept Arc<Config> for routing through the memory-tree pipeline.
Per-file ingestion and change detection
src/openhuman/vault/sync.rs (179–296)
Compute SHA-256 content hash and return Unchanged when content matches prior hash; derive stable source_id and preemptively delete prior memory-tree chunks when content differs; build DocumentInput with vault/ext tags and call ingest_pipeline::ingest_document; map results to IngestFileResult with document_id set to the stable source_id.
Discovery wiring and concurrent ingestion
src/openhuman/vault/sync.rs (468–473, 491–498)
Discovery emits FileToProcess records with prev_hash and vault_id; concurrent ingestion shares a cloned Config via Arc so each worker calls process_file(Arc::clone(&config), file) for each candidate.
Memory-tree file deletion with legacy fallback
src/openhuman/vault/sync.rs (550–637)
Deletion driven by memory-tree delete_chunks_by_source; for pre-#2705 ledger rows with non-vault: prefixed document_id, recomputes stable vault_source_id for chunk deletion and performs best-effort doc_delete for UnifiedMemory migration cleanup.
Regression and idempotency tests
src/openhuman/vault/sync.rs (694–894)
New sync_tests module verifies mem_tree_chunks and mem_tree_ingested_sources population, validates ledger document_id encodes stable memory-tree source_ids, confirms unchanged re-sync is idempotent, and includes a unit test for vault_source_id determinism and namespacing.
E2E test alignment with memory-tree behavior
tests/vault_sync_e2e.rs (5–12, 22–23, 56, 117–146, 185–207)
Updated vault_sync_roundtrip_updates_memory_and_ledger e2e test to assert memory-tree ingestion: post-first-sync validates chunk population and mem_tree_ingested_sources registration for vault:<vault_id>:<path> source IDs; post-second-sync validates update/delete/add lifecycle under memory-tree semantics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • tinyhumansai/openhuman#1994: Both PRs modify the per-file ingestion and deduplication logic in vault sync, including use of stable per-file vault-scoped identifiers for tracking content and avoiding duplicates.

Suggested labels

rust-core

Suggested reviewers

  • graycyrus
  • M3gA-Mind

Poem

🐰 From memory's old attic to the tree so bright,
Files synced with hashes, stable paths in sight,
No duplication blooms where vault: IDs flow,
The ledger learns to skip what hasn't changed below! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: routing vault sync from UnifiedMemory (legacy) to memory_tree (modern retrieval backend), directly addressing issue #2705.
Linked Issues check ✅ Passed The PR comprehensively addresses #2705: vault sync now writes to memory_tree tables (mem_tree_chunks, mem_tree_ingested_sources) using stable source_ids, making synced content visible to retrieval, tree browsing, summaries, and agent recall.
Out of Scope Changes check ✅ Passed All changes are scoped to vault sync routing and memory-tree integration; deletions cleanup, content-hash skipping, and test regression suite are all directly tied to issue #2705 objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. bug labels May 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openhuman/vault/sync.rs (1)

136-136: 💤 Low value

Unused field existing_doc_id after migration.

The existing_doc_id field is populated at line 452 but never read in the new process_file implementation. This appears to be a remnant from the legacy doc_ingest path.

🧹 Remove unused field
 struct FileToProcess {
     rel_path: String,
     title: String,
     path: PathBuf,
     mtime_ms: i64,
     bytes: u64,
     ext: String,
     /// Content hash from the previous successful sync, for secondary dedup.
     prev_hash: Option<String>,
-    /// Document ID to update on re-ingest (keeps embedding lineage stable).
-    existing_doc_id: Option<String>,
     /// Memory namespace (`vault:<id>`).
     namespace: String,
     /// Vault id for tags and state updates.
     vault_id: String,
 }

And at line 452:

-            existing_doc_id: prev.map(|p| p.document_id.clone()),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/vault/sync.rs` at line 136, The field existing_doc_id in the
struct (declared as existing_doc_id: Option<String>) is no longer used by the
new process_file implementation; remove the unused field and any assignments to
it (the population at the site where existing_doc_id is set around the former
doc_ingest path, e.g., the code at the location that sets existing_doc_id near
line 452) and clean up related variables/usages so the struct and process_file
logic only include necessary fields; ensure compilation by removing or
refactoring any code that referenced existing_doc_id (search for
existing_doc_id, its assignment, and any related comments) and run cargo
build/tests to confirm no remaining references.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/vault/sync.rs`:
- Line 136: The field existing_doc_id in the struct (declared as
existing_doc_id: Option<String>) is no longer used by the new process_file
implementation; remove the unused field and any assignments to it (the
population at the site where existing_doc_id is set around the former doc_ingest
path, e.g., the code at the location that sets existing_doc_id near line 452)
and clean up related variables/usages so the struct and process_file logic only
include necessary fields; ensure compilation by removing or refactoring any code
that referenced existing_doc_id (search for existing_doc_id, its assignment, and
any related comments) and run cargo build/tests to confirm no remaining
references.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9d4a2856-3a34-4876-93db-6363db338593

📥 Commits

Reviewing files that changed from the base of the PR and between 0fddf11 and 0181448.

📒 Files selected for processing (1)
  • src/openhuman/vault/sync.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 27, 2026
justinhsu1477 added a commit to justinhsu1477/openhuman that referenced this pull request May 27, 2026
…tion

CodeRabbit nitpick on tinyhumansai#2720: `existing_doc_id` was populated from the
prior-sync ledger row for the legacy `doc_ingest` path (used as the
update key for `memory_docs`), but the memory_tree pipeline keys on
`source_id` directly so the field has no reader after the migration.

While here, dropping `namespace` for the same reason — the legacy
`IngestDocParams.namespace = "vault:<id>"` no longer flows anywhere
because the new path encodes the same scope into the stable
`source_id = vault:{vault_id}:{rel_path}` and tags chunks with
`vault:{vault_id}` directly.

No behavior change — purely struct-side cleanup. 3/3 vault::sync
regression tests still pass.
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 27, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 27, 2026
justinhsu1477 added a commit to justinhsu1477/openhuman that referenced this pull request May 27, 2026
…ansai#2720)

Pre-existing integration test `vault_sync_roundtrip_updates_memory_and_ledger`
asserted the old UnifiedMemory behavior via `list_documents(namespace)`
— which was the silent-failure surface this PR set out to fix. After
the memory_tree migration in tinyhumansai#2720 (commit 0181448), vault sync no
longer writes to `memory_docs`, so the legacy assertion was guaranteed
to fail.

Replaced the document-list assertions with direct memory_tree probes:

* First sync — assert `count_chunks(&config) > 0` and
  `is_source_ingested(config, SourceKind::Document, "vault:{id}:{path}")`
  for both files. Also pin the ledger contract: `document_id` now
  encodes the memory-tree source_id (prefix `vault:`).

* Second sync — pin the full lifecycle:
  - `notes/one.md` content-updated → still registered (delete + re-ingest
    via `delete_chunks_by_source` then `ingest_pipeline::ingest_document`).
  - `docs/two.json` file removed → no longer registered (Phase 4 must
    delete from `mem_tree_ingested_sources`).
  - `docs/three.toml` new file → freshly registered.

Removed the now-dead `documents_from_payload` helper since no caller
remains. `memory_global::init` is still called in setup (initialises
the in-process memory client the way a real launch would).

Verification: `cargo test --test vault_sync_e2e` passes; full lib
suite `cargo test --lib vault` 28/28 pass.
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 27, 2026
@YellowSnnowmann
Copy link
Copy Markdown
Contributor

justinhsu1477 could you please fix the tests and then push?

justinhsu1477 added a commit to justinhsu1477/openhuman that referenced this pull request May 27, 2026
…l_sync pollution

`memory_ingestion_status_reflects_initialized_client_snapshot` was
asserting `status.queue_depth == 1` against the process-global
`IngestionState`, on the assumption that the lock acquired at the top
of the test guarantees a clean baseline.

The two sibling tests (`memory_sync_channel_publishes_targeted_event`
and `memory_sync_all_publishes_broadcast_event`) call
`memory_sync_channel` / `memory_sync_all` → `spawn_manual_sync`, which
detaches a `tokio::spawn(...)` background task that runs
`composio::run_connection_sync`. That detached task can enqueue an
ingestion job via `MemoryClient::store_skill_sync` →
`IngestionQueue::submit` → `state.enqueue()`. By the time this test
acquires `GLOBAL_MEMORY_TEST_LOCK`, that background work may already
have bumped the global counter — so the equality assertion fires
with `left: 2, right: 1`.

CI hit this on tinyhumansai#2720 (\"Rust Core Tests + Quality\") despite the lock
because the lock only serialises against tests that *also* acquire it
— not the detached worker that was spawned earlier and is still
draining. The same flake is reachable on `upstream/main` in CI with
a parallel-test ordering that puts the snapshot test directly after
one of the sync-channel tests; my PR's added vault sync tests didn't
cause it, but they shifted the ordering enough to expose it.

The principled fix is to snapshot the baseline `queue_depth` at the
start of the test and assert the delta. That keeps the test's
contract intact (an `enqueue + mark_running` round-trip must surface
in the snapshot) without depending on the rest of the suite leaving
the global state pristine.

Targeted run on the failure window passes:

  cargo test --lib -- memory::ops::sync vault::sync
  → 9/9 pass
@justinhsu1477
Copy link
Copy Markdown
Contributor Author

@YellowSnnowmann thanks — pushed `f8b50543` about 5 min before your ping.

The failing test (`memory_ingestion_status_reflects_initialized_client_snapshot`) was asserting `queue_depth == 1` against the process-global `IngestionState`. Sibling tests in the same module call `spawn_manual_sync`, which detaches a `tokio::spawn(...)` that runs `composio::run_connection_sync` → `MemoryClient::store_skill_sync` → `IngestionQueue::submit` → `state.enqueue()`. That background task can still be draining when the snapshot test acquires `GLOBAL_MEMORY_TEST_LOCK`, so the global counter isn't 0.

Fix is to snapshot the baseline and assert the delta — pre-existing flake on `upstream/main`, the added vault sync tests just shifted the parallel-test ordering enough to expose it. Full reasoning in the commit body.

`cargo test --lib -- memory::ops::sync vault::sync` passes 9/9 locally; CI re-running now.

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 27, 2026
// singleton therefore can't be assumed to start at queue_depth=0.
// Capture the baseline and assert the delta instead.
let baseline_depth = state.snapshot().queue_depth;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Blocker — unresolved merge conflict against main. Merging this branch with current main leaves git conflict markers around this line, so the merged tree won't compile. Both sides add the same let baseline_depth = state.snapshot().queue_depth; de-flake (main landed an equivalent fix independently) — just with different comments. Resolve by keeping one baseline_depth capture and dropping the duplicate + markers before this can merge.

chunks_written,
already_ingested,
);
IngestFileResult::Ingested {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick — false-success edge case. An ingest_document returning Ok(IngestResult { already_ingested: true, chunks_written: 0, .. }) still lands in this Ingested arm and writes a "success" ledger row. The delete-first guard above prevents this on the normal update path, but on a ledger↔memory-tree desync (ledger wiped while the mem_tree_ingested_sources row survives) it would report success while nothing reaches retrieval — the exact false-success mode this PR exists to kill. Consider a log::warn! (or a distinct result) when chunks_written == 0 && already_ingested.

Comment thread src/openhuman/vault/sync.rs Outdated
// pre-ingest cleanup on content updates. Cloning per-task keeps the
// borrow life-cycle simple inside `buffer_unordered` (no `&'a Config`
// bouncing through closures).
let config_for_workers = config.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick — per-file Config clone. config.clone() runs once per candidate file as the stream is polled. buffer_unordered keeps only ~4 alive at a time, so memory is bounded, but a large vault still performs N full Config clones. Wrapping in Arc<Config> and cloning the Arc would make this O(1)-per-file. Minor.

Comment thread src/openhuman/vault/sync.rs Outdated
// ledger rows that pre-date the migration and still carry a legacy
// UnifiedMemory document_id.
let stored_id = prev.document_id.clone();
let source_id = if stored_id.starts_with("vault:") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question — legacy doc_id fallback is a no-op for genuinely-legacy rows. For ledger rows that pre-date #2705, the stored document_id is a UnifiedMemory id whose data lived in memory_docs, never in memory_tree. Recomputing a vault:{id}:{path} source_id and calling delete_chunks_by_source therefore deletes nothing, and the old UnifiedMemory doc is never cleaned up when such a file is removed. Intentional (deferred to the UnifiedMemory-removal follow-up), or should the fallback branch also doc_delete(prev.document_id) during the migration window so legacy rows don't leak?

Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #2720 — fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705)

Walkthrough

This PR migrates the vault sync path off the legacy UnifiedMemory / memory_docs backend onto the memory-tree pipeline (mem_tree_chunks + mem_tree_ingested_sources), which is what every modern retrieval surface actually reads from. process_file now ingests via ingest_pipeline::ingest_document against a stable source_id = vault:{vault_id}:{rel_path}, deletes prior chunks before re-ingesting on content updates (to defeat the content-blind already_ingested gate), and routes vanished-file deletions through delete_chunks_by_source. The diagnosis and the core sync fix are sound and well-tested.

However, this PR cannot merge as-is. Two issues block it: an unresolved merge conflict against current main, and a stale legacy-backend call in the vault-removal/purge path that this migration leaves orphaning data — the same silent-failure class the PR set out to fix.

Changes

File Summary
src/openhuman/vault/sync.rs Core fix: ingest via memory-tree pipeline with stable vault:{id}:{rel} source ids; delete-then-reingest on content change; Phase-4 deletions via delete_chunks_by_source; migration-safe legacy doc_id fallback; 3 new unit tests.
tests/vault_sync_e2e.rs Updated e2e to assert against mem_tree_chunks / mem_tree_ingested_sources instead of list_documents; drops the unused documents_from_payload helper.
src/openhuman/memory/ops/sync.rs Test-only: capture queue_depth baseline and assert delta (de-flake). Currently carries unresolved conflict markers vs main.

Actionable comments (2)

🛑 Blockers

1. src/openhuman/memory/ops/sync.rs:370-387 — Unresolved merge conflict against main

Merging this branch with the current main leaves git conflict markers in this file, so the merged tree does not compile. Both sides introduce the same fix — let baseline_depth = state.snapshot().queue_depth; — with different explanatory comments; main independently landed an equivalent de-flake. Trivial to resolve (keep one capture, drop the duplicate and the markers), but it must be resolved before this PR can merge. See the inline comment for the suggested resolution.

⚠️ Major

2. src/openhuman/vault/ops.rs:116-136vault_remove(purge_memory=true) still purges the legacy backend, orphaning all memory-tree data

Not in the diff, but a direct downstream consequence of this migration. Vault content now lives in mem_tree_chunks / mem_tree_ingested_sources (keyed vault:{id}:{rel_path}), yet the purge path still calls clear_namespace(v.namespace)MemoryClient::clear_namespace, which only touches the legacy vectors / memory_docs tables (memory_store/vectors/store.rs:386) — tables this PR no longer writes to.

Result: removing a vault with purge now deletes nothing meaningful and orphans every mem_tree_chunks row, so retrieval keeps surfacing content from a deleted vault. That is the exact silent-failure class this PR fixes, re-appearing on the removal side. The in-sync Phase-4 deletion only covers per-file vanishing, not whole-vault removal.

The codebase already has the helper and an established pattern for this (composio/ops.rs:560, channels/controllers/ops.rs:565):

use crate::openhuman::memory_store::chunks::store::delete_chunks_by_source_prefix;
use crate::openhuman::memory_store::chunks::types::SourceKind;

let cfg = config.clone();
let prefix = format!("vault:{id}:");
let purged = tokio::task::spawn_blocking(move || {
    delete_chunks_by_source_prefix(&cfg, SourceKind::Document, &prefix)
})
.await;
// (optionally also keep clear_namespace for migration-window legacy rows)

At minimum, acknowledge this in the PR body's "Out of scope" section if you intend to defer it.

Nitpicks (3)

  • src/openhuman/vault/sync.rs:250-270 — A successful ingest_document returning already_ingested: true with chunks_written == 0 is still counted as Ingested and writes a "success" ledger row. The delete-first guard prevents this on the normal path, but a ledger↔memory-tree desync would re-introduce false-success reporting. Consider a log::warn! when chunks_written == 0 && already_ingested. (inline)
  • src/openhuman/vault/sync.rs:474-476Config is cloned once per candidate file; Arc<Config> would avoid N copies on large vaults. (inline)
  • tests/vault_sync_e2e.rs:117-145 vs src/openhuman/vault/sync.rs:748-766 — inconsistent spawn_blocking convention between the e2e and unit tests for the same is_source_ingested / count_chunks calls; pick one. sync_writes_to_memory_tree could also tighten chunks_after > chunks_before to chunks_before == 0 (fresh tempdir).

Questions for the author (2)

  • src/openhuman/vault/sync.rs:543-552 — for genuinely pre-#2705 ledger rows the recomputed memory-tree source_id has no chunks (data was in memory_docs), so delete_chunks_by_source is a no-op and the legacy doc leaks on deletion. Deferred, or should the fallback also doc_delete the legacy id? (inline)
  • src/openhuman/vault/types.rs:56 — is Vault.namespace still meaningful post-migration? Its only remaining consumer is the (now-misdirected) purge call; fixing #2 by prefix-delete may make it vestigial.

Verified / looks good

  • delete_chunks_by_source clears both mem_tree_chunks and mem_tree_ingested_sources (memory_store/chunks/store.rs:809-832), so the delete-then-reingest content-update path genuinely resets the already_ingested gate. The central design assumption holds.
  • Error handling is fail-closed: a failed pre-reingest delete returns Failed rather than re-ingesting alongside stale chunks; the next sync self-heals.
  • delete_chunks_by_source and the Phase-4 deletion correctly run under spawn_blocking (SQLite is sync I/O).
  • sync_writes_to_memory_tree precisely pins the #2705 regression (chunk count up + both source ids registered + ledger stores the vault: source id); the e2e exercises the full update / delete / add lifecycle.

…yhumansai#2705)

Before this fix, `vault::sync::sync_vault` walked the vault directory,
read each file, and called `doc_ingest` → `MemoryClient::ingest_doc`
→ `UnifiedMemory::ingest_document`. UnifiedMemory wrote to the legacy
`memory_docs` table. The user's "synced" message was therefore
technically correct (the legacy backend accepted the writes), but every
modern retrieval surface — `memory.search`, `tree.read_chunk`,
`tree.browse`, the agent's recall path, the summary-tree builder —
reads from the memory-tree backend (`mem_tree_chunks` +
`mem_tree_ingested_sources`). Vault data was invisible to all of them.

That explains the entire bug report (tinyhumansai#2705):

* `SELECT COUNT(*) FROM mem_tree_chunks` returns 0
  (vault sync never wrote there)
* `SELECT COUNT(*) FROM mem_tree_ingested_sources` returns 0 (same)
* "Memory wiped: 0 rows removed" (wipe clears memory_tree, which was
  already empty)
* "Build summary trees" produces no jobs (memory_tree has no chunks)
* Agent can't recall vault content (agent retrieval reads memory_tree)

Root cause: recent PRs (tinyhumansai#2585, tinyhumansai#2556, tinyhumansai#2574) migrated RAG primitives
to memory_tree as the canonical layer, but the vault sync path was
not migrated alongside it.

## Fix

`process_file` now calls `memory::ingest_pipeline::ingest_document`
directly with a stable `source_id` of the form
`vault:{vault_id}:{rel_path}`. The pipeline writes to
`mem_tree_chunks` + `mem_tree_ingested_sources` — exactly the tables
the modern retrieval stack reads from.

Three additional design choices:

1. **Content-update path** — the pipeline's `already_ingested(
   SourceKind::Document, source_id)` gate is content-blind and the
   source_id is stable per file path. For real content updates the
   vault layer drops prior chunks via
   `memory_store::chunks::store::delete_chunks_by_source` *before*
   the re-ingest, otherwise the new content gets short-circuited. The
   vault ledger's `content_hash` check still gates whether we run the
   delete+reingest at all, so untouched files cost zero pipeline work.

2. **Deletion path** — Phase 4 (`by_path` entries the walk didn't see)
   now calls `delete_chunks_by_source` instead of the legacy
   `doc_delete`. Handles migration too: ledger entries whose stored
   `document_id` doesn't start with `vault:` (rows persisted before
   this fix) fall back to a recomputed source_id rather than failing.

3. **Vault ledger** — the `VaultFile.document_id` field is now the
   memory-tree source_id. Schema column name unchanged for backward
   compat with persisted rows; only the semantic of what we store
   changed. Deletion uses the prefix-check above to handle the
   migration window.

## Tests

Three new regression tests in `vault::sync::sync_tests` (28/28 vault
suite passes):

* `sync_writes_to_memory_tree` — the tinyhumansai#2705 regression. Creates a
  vault with two .md files, runs `sync_vault`, asserts `count_chunks`
  goes up and both source_ids appear in `mem_tree_ingested_sources`.
  Also pins the ledger contract: `document_id` must start with
  `vault:` so the deletion-path prefix check stays correct.

* `second_sync_with_no_changes_is_idempotent` — pins that re-sync
  with unchanged content does not duplicate chunks (the vault-layer
  hash dedup guards the pipeline).

* `vault_source_id_is_stable_and_namespaced` — unit test on the id
  format itself; defends against an accidental rename that would
  break the `already_ingested` gate's cross-file isolation.

`cargo test --lib vault` 28/28 pass.
`cargo check --lib` + `cargo fmt --check` +
`cargo test --tests --no-run` all clean (lesson from tinyhumansai#2603 — must
compile the integration-test target before push).

## Out of scope (separate audits / PRs)

* Composio providers and agent_experience still call `doc_ingest`
  → UnifiedMemory. If they have a similar gap, that's a separate
  audit; this PR is scoped to the vault path the user reported.
* Removing `UnifiedMemory` entirely is the larger follow-up senamakel
  listed on tinyhumansai#2585; out of scope here.
* In-flight tracing during phase 2 / 3 is left at the existing
  `log::debug!` level — the new ingestion log lines along with the
  pre-existing entry/exit logs cover the silent-failure surface the
  issue's reporter would have used to triage.

Closes tinyhumansai#2705.
…tion

CodeRabbit nitpick on tinyhumansai#2720: `existing_doc_id` was populated from the
prior-sync ledger row for the legacy `doc_ingest` path (used as the
update key for `memory_docs`), but the memory_tree pipeline keys on
`source_id` directly so the field has no reader after the migration.

While here, dropping `namespace` for the same reason — the legacy
`IngestDocParams.namespace = "vault:<id>"` no longer flows anywhere
because the new path encodes the same scope into the stable
`source_id = vault:{vault_id}:{rel_path}` and tags chunks with
`vault:{vault_id}` directly.

No behavior change — purely struct-side cleanup. 3/3 vault::sync
regression tests still pass.
…ansai#2720)

Pre-existing integration test `vault_sync_roundtrip_updates_memory_and_ledger`
asserted the old UnifiedMemory behavior via `list_documents(namespace)`
— which was the silent-failure surface this PR set out to fix. After
the memory_tree migration in tinyhumansai#2720 (commit 0181448), vault sync no
longer writes to `memory_docs`, so the legacy assertion was guaranteed
to fail.

Replaced the document-list assertions with direct memory_tree probes:

* First sync — assert `count_chunks(&config) > 0` and
  `is_source_ingested(config, SourceKind::Document, "vault:{id}:{path}")`
  for both files. Also pin the ledger contract: `document_id` now
  encodes the memory-tree source_id (prefix `vault:`).

* Second sync — pin the full lifecycle:
  - `notes/one.md` content-updated → still registered (delete + re-ingest
    via `delete_chunks_by_source` then `ingest_pipeline::ingest_document`).
  - `docs/two.json` file removed → no longer registered (Phase 4 must
    delete from `mem_tree_ingested_sources`).
  - `docs/three.toml` new file → freshly registered.

Removed the now-dead `documents_from_payload` helper since no caller
remains. `memory_global::init` is still called in setup (initialises
the in-process memory client the way a real launch would).

Verification: `cargo test --test vault_sync_e2e` passes; full lib
suite `cargo test --lib vault` 28/28 pass.
graycyrus
graycyrus previously approved these changes May 27, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[APPROVED]

This is a solid fix for #2705. Vault sync now correctly routes through the memory-tree pipeline (mem_tree_chunks + mem_tree_ingested_sources) instead of the legacy UnifiedMemory backend. The migration is clean: stable per-file source_ids ensure idempotency, content updates properly delete stale chunks before re-ingesting, and pre-fix ledger rows are handled gracefully.

Tests are comprehensive — three unit tests pin the regression and the invariants, plus e2e validation of the full lifecycle. Code patterns are solid (async/blocking correctly separated, error context preserved, logging appropriate).

No issues. Ship it.

…i#2720

Three improvements from @sanil-23's review pass:

1. **Surface ledger↔memory_tree desync as a `warn!` log instead of a
   silent success** (L263). When `ingest_document` returns
   `Ok { already_ingested: true, chunks_written: 0 }` we still land in
   the `Ingested` arm and write a fresh ledger row — but no chunks
   actually reach retrieval. The delete-first guard above prevents
   this on the normal update path, so seeing it means ledger and
   memory_tree are out of sync. That's the exact silent-failure mode
   this PR set out to kill, so it now logs at `warn!` with the
   suggestion for a manual `delete_chunks_by_source` resync.

2. **Share a single `Config` allocation across the buffer_unordered
   workers via `Arc<Config>`** (L474). The previous loop did one
   `config.clone()` per candidate file. With `Arc` a 5k-file vault
   pays one clone + N atomic ref-count bumps instead of N full
   `Config` deep-clones — measurably cheaper on cold backfills.
   Signature change: `process_file(config: Arc<Config>, ...)`.

3. **Run a parallel UnifiedMemory `doc_delete` on the legacy-ledger
   fallback path** (L544). Pre-tinyhumansai#2705 ledger rows store a
   UnifiedMemory `{ts}_{hex}` id whose data lives in `memory_docs`,
   not `mem_tree_*`. Recomputing the memory_tree source_id and
   running `delete_chunks_by_source` deletes nothing on those rows
   — so without a parallel `doc_delete` the legacy data leaked
   until UnifiedMemory removal lands (tinyhumansai#2585 follow-up). The deletion
   path now does both during the migration window so vanished
   files actually go away.

`doc_delete` failures on the legacy path are best-effort: a
404 / already-absent there shouldn't block the canonical
`delete_chunks_by_source` cleanup below.

Tests: 9/9 in `memory::ops::sync` + `vault::sync` pass.
`cargo check --lib`, `cargo fmt --check`, and
`cargo test --tests --no-run` all clean.
@justinhsu1477 justinhsu1477 dismissed stale reviews from graycyrus and coderabbitai[bot] via e7c61b1 May 27, 2026 11:10
@justinhsu1477 justinhsu1477 force-pushed the fix/vault-sync-silent-failure branch from f8b5054 to e7c61b1 Compare May 27, 2026 11:10
@justinhsu1477
Copy link
Copy Markdown
Contributor Author

@sanil-23 thanks for the thorough pass — all four addressed in `e7c61b14` (rebased on `upstream/main` so the #2737 conflict is gone).

🛑 L378 conflict — Resolved by rebase: dropped my standalone `baseline_depth` commit since #2737 already landed the same de-flake with a cleaner `state.reset_for_test()` helper. Kept that, dropped mine.

L263 false-success — Added a `log::warn!` for the `already_ingested == true && chunks_written == 0` branch. The delete-first guard prevents this on the normal update path, so seeing it means ledger ↔ memory_tree desync — exactly the silent mode this PR exists to kill. Log line names the source_id + suggests a manual `delete_chunks_by_source` resync.

L474 per-file Config clone — Switched to `Arc`. `process_file` now takes `Arc`, the call site does one outer `Arc::new(config.clone())` then `Arc::clone(&...)` per candidate. A 5k-file backfill pays one deep clone + N ref-count bumps instead of N full clones.

L544 legacy fallback no-op — You're right, it was silently leaving legacy `memory_docs` rows behind. Migration-window fix: when `document_id` doesn't start with `vault:` (= pre-#2705 ledger row), the deletion path now runs `doc_delete(prev.document_id)` against UnifiedMemory in parallel with `delete_chunks_by_source` against the recomputed memory_tree source_id. `doc_delete` failures are best-effort (a 404 on the legacy path shouldn't block the canonical memory_tree cleanup).

`cargo test --lib -- memory::ops::sync vault::sync` → 9/9 pass.
`cargo check --lib` + `cargo fmt --check` + `cargo test --tests --no-run` all clean.

Pushed as force-with-lease since the rebase was unavoidable.

@coderabbitai coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.54.0 — Vault sync UI shows "synced" but mem_tree_chunks and mem_tree_ingested_sources stay at 0

4 participants