You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
memory::ops unit tests intermittently fail under Rust Core Coverage (cargo-llvm-cov) with two signatures — an inflated IngestionState.queue_depth, and a tool_rule_put DB write erroring mid-test. Both stem from memory::ops tests sharing one process-global memory client whose async lifecycle isn't fully gated by the test lock.
memory::ops::tool_memory::tests::tool_rules_for_prompt_sorts_by_priority_and_tool_name — second tool_rule_put(...).expect("put normal") returned Err (the first put succeeded). (still open — this issue)
Both tests pass deterministically in isolation; they only fail under cargo-llvm-cov's slower parallel timing, and intermittently (~1/9500).
Root cause: All memory::ops tests share one process-global memory client — one SQLite DB and one IngestionState — pinned by ensure_shared_memory_client (memory/ops/test_support.rs). GLOBAL_MEMORY_TEST_LOCK (memory/ops/mod.rs) serializes the test bodies, but it does not cover:
The shared SQLite connection / client's async lifecycle / background ingestion — a write can hit transient contention (SQLITE_BUSY/IOERR) when work from a prior test outlives the body-lock. This is the tool_memory.expect("put normal") failure.
Steps to reproduce: non-deterministic; runs cargo llvm-cov (full core test suite, parallel) repeatedly. Platform: CI Linux runner (cargo-llvm-cov), toolchain 1.93.0. Not seen on the plain cargo test runner.
Solution (optional)
Candidate fixes (each has trade-offs — pick per maintainer judgement):
Reset the shared IngestionState / drain in-flight ingestion at the start of each locked test, so each test begins from a clean global.
Quiesce/await the shared client's background ingestion under GLOBAL_MEMORY_TEST_LOCK so async work cannot bleed past the body-lock.
Per-test isolated client (fights the deliberate shared-client design in ensure_shared_memory_client).
SQLite busy-timeout/retry on memory-store writes (a production change; there is prior art in whatsapp_data/sqlite_retry.rs).
Note: #2649 added GLOBAL_MEMORY_TEST_LOCK for memory::global, and #2712 pinned OPENHUMAN_WORKSPACE under TEST_ENV_LOCK for documents tests — this is a remaining gap on the memory::ops client lifecycle/queue axis.
Acceptance criteria
Repro gone — memory::ops tests no longer flake under cargo-llvm-cov (validate by repeated CI runs / a stress loop).
Regression safety — the chosen isolation/reset mechanism is covered so the bleed can't return.
Summary
memory::opsunit tests intermittently fail under Rust Core Coverage (cargo-llvm-cov) with two signatures — an inflatedIngestionState.queue_depth, and atool_rule_putDB write erroring mid-test. Both stem frommemory::opstests sharing one process-global memory client whose async lifecycle isn't fully gated by the test lock.Problem
Observed (CI, PR #2717 / #2721 runs):
memory::ops::sync::tests::memory_ingestion_status_reflects_initialized_client_snapshot—assert_eq!(queue_depth, 1)got2. (addressed by fix(test): make memory ingestion-status test residue-robust (queue_depth delta) #2721)memory::ops::tool_memory::tests::tool_rules_for_prompt_sorts_by_priority_and_tool_name— secondtool_rule_put(...).expect("put normal")returnedErr(the firstputsucceeded). (still open — this issue)Both tests pass deterministically in isolation; they only fail under
cargo-llvm-cov's slower parallel timing, and intermittently (~1/9500).Root cause: All
memory::opstests share one process-global memory client — one SQLite DB and oneIngestionState— pinned byensure_shared_memory_client(memory/ops/test_support.rs).GLOBAL_MEMORY_TEST_LOCK(memory/ops/mod.rs) serializes the test bodies, but it does not cover:IngestionState.queue_depth(anAtomicUsize) — never reset between tests, so residue accumulates. (mitigated for the sync test in fix(test): make memory ingestion-status test residue-robust (queue_depth delta) #2721 via a delta assertion; the underlying residue remains.)SQLITE_BUSY/IOERR) when work from a prior test outlives the body-lock. This is thetool_memory.expect("put normal")failure.Steps to reproduce: non-deterministic; runs
cargo llvm-cov(full core test suite, parallel) repeatedly. Platform: CI Linux runner (cargo-llvm-cov), toolchain 1.93.0. Not seen on the plaincargo testrunner.Solution (optional)
Candidate fixes (each has trade-offs — pick per maintainer judgement):
IngestionState/ drain in-flight ingestion at the start of each locked test, so each test begins from a clean global.GLOBAL_MEMORY_TEST_LOCKso async work cannot bleed past the body-lock.ensure_shared_memory_client).whatsapp_data/sqlite_retry.rs).Note: #2649 added
GLOBAL_MEMORY_TEST_LOCKformemory::global, and #2712 pinnedOPENHUMAN_WORKSPACEunderTEST_ENV_LOCKfor documents tests — this is a remaining gap on thememory::opsclient lifecycle/queue axis.Acceptance criteria
memory::opstests no longer flake undercargo-llvm-cov(validate by repeated CI runs / a stress loop).tool_memory.expect("put normal")contention failure specifically no longer occurs.Related
syncingestion-status assertion (delta vs absolute); partial mitigation.memory::ops).memorytest-isolation work this builds on.