Skip to content

feat(ingest): cursor_sqlite format — ingest Cursor IDE chat history from state.vscdb (RFC #361)#377

Merged
eric-tramel merged 4 commits into
mainfrom
feat/cursor-sqlite-ingest
Jun 12, 2026
Merged

feat(ingest): cursor_sqlite format — ingest Cursor IDE chat history from state.vscdb (RFC #361)#377
eric-tramel merged 4 commits into
mainfrom
feat/cursor-sqlite-ingest

Conversation

@eric-tramel

@eric-tramel eric-tramel commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Implements RFC #361: ingest Cursor IDE chat history directly from state.vscdb (SQLite) as a new fully supported cursor_sqlite source format. Cursor agent-transcript JSONL support (#363) covers cursor-agent CLI sessions; this adds the IDE-side composer/bubble history that only exists in SQLite, completing Cursor harness coverage.

What changed

New SQLite polling engine (crates/moraine-ingest-core/src/sqlite_poll.rs)

  • Hash-cursor change detection per RFC RFC: add efficient SQLite polling support for database-backed harnesses #361: cursorDiskKV has no watermark column, so the checkpoint carries a per-key content-hash map (cursor_json) plus a db/wal/shm stat fingerprint (nanosecond mtimes) for cheap no-change short-circuits.
  • Read-only opens, never checkpoints another app's WAL (query_only=ON, 500ms busy timeout). Cleanly-closed WAL databases on read-only media (e.g. the dev sandbox's :ro bind mount) are retried with immutable=1 — safe exactly because the WAL sidecars are absent. The refusal surfaces as SQLITE_CANTOPEN on Linux and SQLITE_READONLY_DIRECTORY on macOS; both retry.
  • Values are read by storage class (TEXT/BLOB/NULL all accepted): real Cursor writes JSON as TEXT despite the BLOB-declared column. Caught by integration testing against real host databases — synthetic fixtures originally stored BLOBs and masked it; fixtures now match reality.
  • Paged prefix scans (512 rows / 32 MiB per page, whichever first) with per-page record synthesis, so raw value bytes (Cursor stores ~1.2 MB base64 screenshots inside tool results, doubled by toolCallBinary) never accumulate beyond one page.
  • 10k relevant-key ceiling (RFC RFC: add efficient SQLite polling support for database-backed harnesses #361) enforced both as a pre-count and against the actual scan, with a clear "use JSONL transcripts instead" error.
  • Failure modes (sqlite_open_error, sqlite_schema_mismatch, sqlite_cursor_too_large, sqlite_scan_error) emit one ingest_errors row per state change — repeated identical failures send nothing (no checkpoint churn), and error checkpoints never advance last_offset, so a failed poll can never out-rank a pending success checkpoint in the sink's flush window.

Stable logical event identity (sources/cursor.rs)

  • Mutable DB rows can't hash raw_json into UIDs (RFC RFC: add efficient SQLite polling support for database-backed harnesses #361 §7). Bubble/composer events mint UIDs from logical identity (cursor_sqlite:bubble:{composer}:{bubble} etc.) and always stamp creation time, so a mutated row re-emits byte-identical sort keys and ReplacingMergeTree collapses versions. Verified live in e2e: rewriting a composer + appending a bubble leaves exactly one session_meta row.
  • Rows without a parseable createdAt are deferred (no event, no error spam) until Cursor writes one — placeholder timestamps would strand permanent epoch-dated duplicates in the sort key.
  • Tri-modal assistant bubbles (text / capabilityType 30 thinking / 15 toolFormerData) → message / reasoning / tool_call+tool_result events; toolCallBinary dropped; embedded-JSON params/results parsed; >64 KiB strings elided; tool_call_id takes the first line of Cursor's newline-joined id pair; degenerate mcp-- tool names repaired from rawArgs.
  • Ghost composers (Cursor pre-creates a composerData shell per window: no headers, no name) are skipped — 12 of 15 composers in the real host DB are shells.

Watcher/dispatch integration

  • state.vscdb-wal/-shm sidecar events map to the canonical DB path and coalesce (WAL-mode writes often touch only sidecars).
  • Symlink canonicalization for cursor_sqlite paths only (macOS FSEvents reports /private/var/... for /var/... watch roots, which split one DB into two checkpoint identities). Deliberately not applied to existing file-backed formats: their checkpoint keys and event UIDs embed source_file, and re-keying long-standing symlinked sources (dotfiles-managed ~/.codex etc.) would orphan checkpoints and duplicate history. Regression-tested both ways.
  • cursor_sqlite requires harness = "cursor" (validated at config load; .vscdb format inference is harness-gated the same way the hermes session_json inference is).

Schema (sql/015_sqlite_checkpoint_cursor.sql)

  • ingest_checkpoints gains cursor_json String, source_fingerprint UInt64, schema_fingerprint UInt64 (additive, IF NOT EXISTS, default-valued). Pre-migration ingestors degrade gracefully: a startup warning explains cursors won't persist until migrate and restart.
  • Transcript fetch dedups by event_uid (LIMIT 1 BY, newest event_version wins). events is partitioned by toYYYYMM(ingested_at) and ReplacingMergeTree never collapses across partitions; cursor_sqlite makes mutation-re-emit a steady-state path (a session resumed next month re-emits its composer row), so the read side now renders each event exactly once. See "Known limitations".

Operational impact

  • Migration 015 (additive ALTERs) applies on moraine up / moraine db migrate.
  • Default on: the cursor-sqlite source ships enabled in both default_sources() (platform-aware: ~/Library/Application Support/Cursor/User on macOS, ~/.config/Cursor/User elsewhere) and the config/moraine.toml template. Machines without Cursor simply match nothing. Configs with their own explicit [[ingest.sources]] are unaffected. Cursor has no stable local-database contract — when the schema drifts, moraine reports rate-limited sqlite_* errors and skips the database rather than ingesting bad rows, and the normalizer gets updated to follow the format; tracking that drift is moraine's job. Two template-guard tests pin the default-on state. Docs: format table/matrix + a new "SQLite-Polled Sources" section in docs/development/ingest-sources.md; README/docs harness tables now list Cursor (pre-existing gap).
  • Dev sandbox mounts the host's Cursor state dir read-only via --mount-host-sessions (SANDBOX_CURSOR_STATE_DIR, default ~/Library/Application Support/Cursor/User).

Scope deviations from RFC #361 (flagged for owner veto)

  • agentKv:blob:* (provider-wire duplicates, 66 MB observed), checkpointId:*, and ItemTable (holds live auth tokens — must never be ingested) are excluded from v1. Composer + bubble keys carry the full conversation content.

Known limitations

  • Cross-month duplicate rows in events/search_documents storage remain possible for sessions mutated across a calendar-month boundary (partition key is toYYYYMM(ingested_at)). The primary transcript fetch dedups at read time (above); v_conversation_trace and search-side dedup need a follow-up (view rebuild migration). This is pre-existing behavior for hermes session_json rewrites; cursor_sqlite makes it more likely, hence the read-side mitigation now.
  • Cursor histories beyond 10k composer/bubble keys are rejected by design (RFC RFC: add efficient SQLite polling support for database-backed harnesses #361 ceiling) with a pointer to JSONL transcripts.

Validation

  • cargo test --workspace --locked — all green (incl. 13 new sqlite_poll unit tests: first-poll emission, no-op short-circuit, stable-UID re-emit on mutation, NULL/BLOB/TEXT storage classes, ghost gating, read-only-directory immutable fallback, schema-mismatch once-only + checkpoint-quiescence, 10k ceiling, createdAt deferral, URI escaping; golden fixture coverage for all bubble modes).
  • make clippy (strict baseline), cargo fmt --all -- --check — clean.
  • bash scripts/ci/e2e-stack.sh — full stack pass, including new cursor-sqlite assertions: 4 raw rows / 5 events / 5 distinct kinds / title extraction / 2 tool_io on first ingest, then a live-update phase (INSERT bubble + rewrite composer against the running stack) asserting 6 events and session_meta still collapsed to 1; MCP smoke passes for all 8 sources.
  • Real-data integration (dev sandbox, host Cursor state mounted read-only): both real composer sessions ingested — 204 and 19 events, titles ("Program functionality explanation", "Cooking ideas inspiration"), 83 tool_call/tool_result pairs with real tool names (read_file_v2, edit_file_v2, MCP browser tools), 16 reasoning events, max payload 10.6 KB (screenshot elision working), zero ingest_errors, 12 ghost composers correctly absent, workspace-storage DBs (no relevant keys) polled cleanly via the immutable fallback.
  • Adversarial multi-agent review over the full diff (5 dimensions, per-finding adversarial verification): 18 confirmed findings — all fixed in this branch except the cross-month partition item, which is mitigated at read time and documented above.

Closes #361.

🤖 Generated with Claude Code

eric-tramel and others added 4 commits June 11, 2026 22:26
…e.vscdb

Adds the shared SQLite polling engine from RFC #361 with Cursor's
state.vscdb KV stores as its first consumer:

- new ingest format `cursor_sqlite` (config constant, validation,
  vscdb extension tracking) with -wal/-shm sidecar events mapped to the
  canonical database path for watching/debounce coalescing
- sqlite_poll engine: read-only opens (busy_timeout + query_only),
  schema validation, bounded prefix-range scans over composerData:/
  bubbleId: keys, kv-hash change cursor persisted in the new
  ingest_checkpoints cursor_json column (migration 015), 10k-key
  ceiling, rate-limited sqlite_* ingest_errors
- synthetic cursor_composer/cursor_bubble records normalized by the
  existing cursor harness with stable logical event UIDs so in-place
  bubble mutations (streaming text, tool pending->completed) re-emit
  the same UID and collapse via ReplacingMergeTree(event_version)
- payload hygiene: toolCallBinary dropped, >64KB strings elided
  (screenshots), ghost composer shells skipped, newline-joined
  toolCallIds split, MCP tool-name repair
- graceful degradation when migration 015 has not run yet (column
  detection at startup; cursors simply do not persist)
- unit tests over real temp databases + a cursor_sqlite golden case
  driven by the production synthesis path

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ble work-item paths

- e2e-stack.sh: plant a fixture state.vscdb via python stdlib sqlite3,
  assert normalization (session_meta title, tri-modal bubbles, tool_io
  pairing), then mutate the database live and assert the watcher path
  re-polls it and the re-emitted composer collapses onto its stable UID
- canonicalize work-item paths at every ingestion entry point (watcher
  events, backfill, reconcile): macOS FSEvents reports symlink-resolved
  paths (/private/var vs /var), which previously gave one file two
  checkpoint keys and two sets of event UIDs — a latent bug for any
  symlinked watch root, exposed by the new live-update e2e phase; the
  e2e tmp root is also resolved with pwd -P so --expect-source-file
  assertions match
- sandbox: --mount-host-sessions now also mounts Cursor's User state dir
  read-only (SANDBOX_CURSOR_STATE_DIR) and generates an enabled
  host-cursor-sqlite source for QA against real databases
- docs: configuration format table + examples, ingest-sources SQLite
  engine section, harness-author workflow, supported-harness tables in
  README/docs index (adding the previously missing Cursor row), and a
  commented opt-in template block in config/moraine.toml

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rsarial review

Real-host findings (sandbox, host Cursor state mounted read-only):
- Read kv values by storage class via get_ref: Cursor writes JSON as
  TEXT despite the BLOB-declared column, and rusqlite's Vec<u8> only
  accepts BLOB — every real database failed to scan while BLOB-seeded
  fixtures passed. Fixtures and the e2e planter now store TEXT.
- Retry refused opens with immutable=1: cleanly closed WAL databases on
  read-only media (sandbox :ro mount) cannot recreate the -shm index.
  Refusal is SQLITE_CANTOPEN on Linux, SQLITE_READONLY_DIRECTORY on
  macOS; both retry. Probe after open so open-class failures stop
  surfacing as bogus schema mismatches.
- Log full anyhow chains ({exc:#}) in scan failure rows.

Adversarial review findings:
- Scope symlink canonicalization to cursor_sqlite paths only: blanket
  canonicalization re-keyed checkpoints and event UIDs for existing
  file-backed sources behind symlinks (one-time full re-ingest plus
  permanent duplicate history). Regression-tested both ways.
- Error checkpoints no longer advance last_offset (they could out-rank
  a same-flush-window success checkpoint and discard its hash cursor),
  and a repeated identical failure sends no checkpoint at all (a
  permanently failing DB inserted a row per reconcile tick forever).
- Synthesize records page-by-page with a 32 MiB page byte cap instead
  of buffering every changed value's raw bytes (screenshot-heavy DBs
  put multiple GB in RSS under the old shape).
- Defer composer/bubble rows without a parseable createdAt: placeholder
  timestamps spam timestamp_parse_error per re-emission and strand a
  permanent epoch-dated duplicate once the real value appears.
- Dedup the transcript fetch by event_uid (newest event_version wins):
  events partitions by toYYYYMM(ingested_at) and ReplacingMergeTree
  never collapses across partitions, so cross-month re-emission (the
  cursor_sqlite steady state) rendered duplicate entries forever.
- Enforce the 10k key ceiling against the actual scan too, and align
  the count query's range with the scan (> vs >=).
- Gate .vscdb format inference on harness=cursor and reject
  cursor_sqlite sources with any other harness at config load.
- Warn that migration 015 needs an ingest restart to take effect, and
  log dropped non-canonical work items.
- Drop the never-evaluated --expect-source-file flag from mcp_smoke.py
  and stop citing it as the reason for e2e tmp_root symlink resolution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ship the cursor-sqlite source enabled in both default_sources()
(platform-aware state dir: ~/Library/Application Support/Cursor/User on
macOS, ~/.config/Cursor/User elsewhere) and the config/moraine.toml
template. Cursor has no stable local-database contract, but tracking
format drift is moraine's job: on schema change the poller reports
rate-limited sqlite_* errors and skips the database until the
normalizer follows the new format, rather than ingesting bad rows.

Machines without Cursor match nothing; configs with explicit sources
are unaffected. Template-guard tests pin the default-on state and that
the shipped template parses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@eric-tramel

Copy link
Copy Markdown
Owner Author

Flipped cursor_sqlite to default on per review (6d06d64): the source now ships enabled in default_sources() (platform-aware Cursor state dir) and in the config/moraine.toml template, with guard tests pinning both. Schema-drift posture stays the same — rate-limited sqlite_* errors + skip, then patch the normalizer to follow the format. PR description updated to match.

🤖 Generated with Claude Code

@eric-tramel eric-tramel merged commit 410073d into main Jun 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: add efficient SQLite polling support for database-backed harnesses

1 participant