Skip to content

memory::ops tests flake under cargo-llvm-cov from shared global client (queue residue + SQLite write contention) #2722

@sanil-23

Description

@sanil-23

Summary

memory::ops unit tests intermittently fail under Rust Core Coverage (cargo-llvm-cov) with two signatures — an inflated IngestionState.queue_depth, and a tool_rule_put DB write erroring mid-test. Both stem from memory::ops tests sharing one process-global memory client whose async lifecycle isn't fully gated by the test lock.

Problem

Observed (CI, PR #2717 / #2721 runs):

  • memory::ops::sync::tests::memory_ingestion_status_reflects_initialized_client_snapshotassert_eq!(queue_depth, 1) got 2. (addressed by fix(test): make memory ingestion-status test residue-robust (queue_depth delta) #2721)
  • memory::ops::tool_memory::tests::tool_rules_for_prompt_sorts_by_priority_and_tool_name — second tool_rule_put(...).expect("put normal") returned Err (the first put succeeded). (still open — this issue)

Both tests pass deterministically in isolation; they only fail under cargo-llvm-cov's slower parallel timing, and intermittently (~1/9500).

Root cause: All memory::ops tests share one process-global memory client — one SQLite DB and one IngestionState — pinned by ensure_shared_memory_client (memory/ops/test_support.rs). GLOBAL_MEMORY_TEST_LOCK (memory/ops/mod.rs) serializes the test bodies, but it does not cover:

  1. IngestionState.queue_depth (an AtomicUsize) — never reset between tests, so residue accumulates. (mitigated for the sync test in fix(test): make memory ingestion-status test residue-robust (queue_depth delta) #2721 via a delta assertion; the underlying residue remains.)
  2. The shared SQLite connection / client's async lifecycle / background ingestion — a write can hit transient contention (SQLITE_BUSY/IOERR) when work from a prior test outlives the body-lock. This is the tool_memory .expect("put normal") failure.

Steps to reproduce: non-deterministic; runs cargo llvm-cov (full core test suite, parallel) repeatedly. Platform: CI Linux runner (cargo-llvm-cov), toolchain 1.93.0. Not seen on the plain cargo test runner.

Solution (optional)

Candidate fixes (each has trade-offs — pick per maintainer judgement):

  • Reset the shared IngestionState / drain in-flight ingestion at the start of each locked test, so each test begins from a clean global.
  • Quiesce/await the shared client's background ingestion under GLOBAL_MEMORY_TEST_LOCK so async work cannot bleed past the body-lock.
  • Per-test isolated client (fights the deliberate shared-client design in ensure_shared_memory_client).
  • SQLite busy-timeout/retry on memory-store writes (a production change; there is prior art in whatsapp_data/sqlite_retry.rs).

Note: #2649 added GLOBAL_MEMORY_TEST_LOCK for memory::global, and #2712 pinned OPENHUMAN_WORKSPACE under TEST_ENV_LOCK for documents tests — this is a remaining gap on the memory::ops client lifecycle/queue axis.

Acceptance criteria

  • Repro gonememory::ops tests no longer flake under cargo-llvm-cov (validate by repeated CI runs / a stress loop).
  • Regression safety — the chosen isolation/reset mechanism is covered so the bleed can't return.
  • Diff coverage ≥ 80% — fix PR meets the changed-lines gate.
  • The tool_memory .expect("put normal") contention failure specifically no longer occurs.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions