feat(ingest): cursor_sqlite format — ingest Cursor IDE chat history from state.vscdb (RFC #361)#377
Merged
Merged
Conversation
…e.vscdb Adds the shared SQLite polling engine from RFC #361 with Cursor's state.vscdb KV stores as its first consumer: - new ingest format `cursor_sqlite` (config constant, validation, vscdb extension tracking) with -wal/-shm sidecar events mapped to the canonical database path for watching/debounce coalescing - sqlite_poll engine: read-only opens (busy_timeout + query_only), schema validation, bounded prefix-range scans over composerData:/ bubbleId: keys, kv-hash change cursor persisted in the new ingest_checkpoints cursor_json column (migration 015), 10k-key ceiling, rate-limited sqlite_* ingest_errors - synthetic cursor_composer/cursor_bubble records normalized by the existing cursor harness with stable logical event UIDs so in-place bubble mutations (streaming text, tool pending->completed) re-emit the same UID and collapse via ReplacingMergeTree(event_version) - payload hygiene: toolCallBinary dropped, >64KB strings elided (screenshots), ghost composer shells skipped, newline-joined toolCallIds split, MCP tool-name repair - graceful degradation when migration 015 has not run yet (column detection at startup; cursors simply do not persist) - unit tests over real temp databases + a cursor_sqlite golden case driven by the production synthesis path Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ble work-item paths - e2e-stack.sh: plant a fixture state.vscdb via python stdlib sqlite3, assert normalization (session_meta title, tri-modal bubbles, tool_io pairing), then mutate the database live and assert the watcher path re-polls it and the re-emitted composer collapses onto its stable UID - canonicalize work-item paths at every ingestion entry point (watcher events, backfill, reconcile): macOS FSEvents reports symlink-resolved paths (/private/var vs /var), which previously gave one file two checkpoint keys and two sets of event UIDs — a latent bug for any symlinked watch root, exposed by the new live-update e2e phase; the e2e tmp root is also resolved with pwd -P so --expect-source-file assertions match - sandbox: --mount-host-sessions now also mounts Cursor's User state dir read-only (SANDBOX_CURSOR_STATE_DIR) and generates an enabled host-cursor-sqlite source for QA against real databases - docs: configuration format table + examples, ingest-sources SQLite engine section, harness-author workflow, supported-harness tables in README/docs index (adding the previously missing Cursor row), and a commented opt-in template block in config/moraine.toml Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rsarial review
Real-host findings (sandbox, host Cursor state mounted read-only):
- Read kv values by storage class via get_ref: Cursor writes JSON as
TEXT despite the BLOB-declared column, and rusqlite's Vec<u8> only
accepts BLOB — every real database failed to scan while BLOB-seeded
fixtures passed. Fixtures and the e2e planter now store TEXT.
- Retry refused opens with immutable=1: cleanly closed WAL databases on
read-only media (sandbox :ro mount) cannot recreate the -shm index.
Refusal is SQLITE_CANTOPEN on Linux, SQLITE_READONLY_DIRECTORY on
macOS; both retry. Probe after open so open-class failures stop
surfacing as bogus schema mismatches.
- Log full anyhow chains ({exc:#}) in scan failure rows.
Adversarial review findings:
- Scope symlink canonicalization to cursor_sqlite paths only: blanket
canonicalization re-keyed checkpoints and event UIDs for existing
file-backed sources behind symlinks (one-time full re-ingest plus
permanent duplicate history). Regression-tested both ways.
- Error checkpoints no longer advance last_offset (they could out-rank
a same-flush-window success checkpoint and discard its hash cursor),
and a repeated identical failure sends no checkpoint at all (a
permanently failing DB inserted a row per reconcile tick forever).
- Synthesize records page-by-page with a 32 MiB page byte cap instead
of buffering every changed value's raw bytes (screenshot-heavy DBs
put multiple GB in RSS under the old shape).
- Defer composer/bubble rows without a parseable createdAt: placeholder
timestamps spam timestamp_parse_error per re-emission and strand a
permanent epoch-dated duplicate once the real value appears.
- Dedup the transcript fetch by event_uid (newest event_version wins):
events partitions by toYYYYMM(ingested_at) and ReplacingMergeTree
never collapses across partitions, so cross-month re-emission (the
cursor_sqlite steady state) rendered duplicate entries forever.
- Enforce the 10k key ceiling against the actual scan too, and align
the count query's range with the scan (> vs >=).
- Gate .vscdb format inference on harness=cursor and reject
cursor_sqlite sources with any other harness at config load.
- Warn that migration 015 needs an ingest restart to take effect, and
log dropped non-canonical work items.
- Drop the never-evaluated --expect-source-file flag from mcp_smoke.py
and stop citing it as the reason for e2e tmp_root symlink resolution.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ship the cursor-sqlite source enabled in both default_sources() (platform-aware state dir: ~/Library/Application Support/Cursor/User on macOS, ~/.config/Cursor/User elsewhere) and the config/moraine.toml template. Cursor has no stable local-database contract, but tracking format drift is moraine's job: on schema change the poller reports rate-limited sqlite_* errors and skips the database until the normalizer follows the new format, rather than ingesting bad rows. Machines without Cursor match nothing; configs with explicit sources are unaffected. Template-guard tests pin the default-on state and that the shipped template parses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Owner
Author
|
Flipped 🤖 Generated with Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements RFC #361: ingest Cursor IDE chat history directly from
state.vscdb(SQLite) as a new fully supportedcursor_sqlitesource format. Cursor agent-transcript JSONL support (#363) coverscursor-agentCLI sessions; this adds the IDE-side composer/bubble history that only exists in SQLite, completing Cursor harness coverage.What changed
New SQLite polling engine (
crates/moraine-ingest-core/src/sqlite_poll.rs)cursorDiskKVhas no watermark column, so the checkpoint carries a per-key content-hash map (cursor_json) plus a db/wal/shm stat fingerprint (nanosecond mtimes) for cheap no-change short-circuits.query_only=ON, 500ms busy timeout). Cleanly-closed WAL databases on read-only media (e.g. the dev sandbox's:robind mount) are retried withimmutable=1— safe exactly because the WAL sidecars are absent. The refusal surfaces asSQLITE_CANTOPENon Linux andSQLITE_READONLY_DIRECTORYon macOS; both retry.TEXT/BLOB/NULLall accepted): real Cursor writes JSON as TEXT despite the BLOB-declared column. Caught by integration testing against real host databases — synthetic fixtures originally stored BLOBs and masked it; fixtures now match reality.toolCallBinary) never accumulate beyond one page.sqlite_open_error,sqlite_schema_mismatch,sqlite_cursor_too_large,sqlite_scan_error) emit oneingest_errorsrow per state change — repeated identical failures send nothing (no checkpoint churn), and error checkpoints never advancelast_offset, so a failed poll can never out-rank a pending success checkpoint in the sink's flush window.Stable logical event identity (
sources/cursor.rs)raw_jsoninto UIDs (RFC RFC: add efficient SQLite polling support for database-backed harnesses #361 §7). Bubble/composer events mint UIDs from logical identity (cursor_sqlite:bubble:{composer}:{bubble}etc.) and always stamp creation time, so a mutated row re-emits byte-identical sort keys andReplacingMergeTreecollapses versions. Verified live in e2e: rewriting a composer + appending a bubble leaves exactly onesession_metarow.createdAtare deferred (no event, no error spam) until Cursor writes one — placeholder timestamps would strand permanent epoch-dated duplicates in the sort key.capabilityType30 thinking / 15 toolFormerData) →message/reasoning/tool_call+tool_resultevents;toolCallBinarydropped; embedded-JSON params/results parsed; >64 KiB strings elided;tool_call_idtakes the first line of Cursor's newline-joined id pair; degeneratemcp--tool names repaired fromrawArgs.composerDatashell per window: no headers, no name) are skipped — 12 of 15 composers in the real host DB are shells.Watcher/dispatch integration
state.vscdb-wal/-shmsidecar events map to the canonical DB path and coalesce (WAL-mode writes often touch only sidecars).cursor_sqlitepaths only (macOS FSEvents reports/private/var/...for/var/...watch roots, which split one DB into two checkpoint identities). Deliberately not applied to existing file-backed formats: their checkpoint keys and event UIDs embedsource_file, and re-keying long-standing symlinked sources (dotfiles-managed~/.codexetc.) would orphan checkpoints and duplicate history. Regression-tested both ways.cursor_sqliterequiresharness = "cursor"(validated at config load;.vscdbformat inference is harness-gated the same way the hermessession_jsoninference is).Schema (
sql/015_sqlite_checkpoint_cursor.sql)ingest_checkpointsgainscursor_json String,source_fingerprint UInt64,schema_fingerprint UInt64(additive,IF NOT EXISTS, default-valued). Pre-migration ingestors degrade gracefully: a startup warning explains cursors won't persist until migrate and restart.event_uid(LIMIT 1 BY, newestevent_versionwins).eventsis partitioned bytoYYYYMM(ingested_at)andReplacingMergeTreenever collapses across partitions;cursor_sqlitemakes mutation-re-emit a steady-state path (a session resumed next month re-emits its composer row), so the read side now renders each event exactly once. See "Known limitations".Operational impact
moraine up/moraine db migrate.cursor-sqlitesource ships enabled in bothdefault_sources()(platform-aware:~/Library/Application Support/Cursor/Useron macOS,~/.config/Cursor/Userelsewhere) and theconfig/moraine.tomltemplate. Machines without Cursor simply match nothing. Configs with their own explicit[[ingest.sources]]are unaffected. Cursor has no stable local-database contract — when the schema drifts, moraine reports rate-limitedsqlite_*errors and skips the database rather than ingesting bad rows, and the normalizer gets updated to follow the format; tracking that drift is moraine's job. Two template-guard tests pin the default-on state. Docs: format table/matrix + a new "SQLite-Polled Sources" section indocs/development/ingest-sources.md; README/docs harness tables now list Cursor (pre-existing gap).--mount-host-sessions(SANDBOX_CURSOR_STATE_DIR, default~/Library/Application Support/Cursor/User).Scope deviations from RFC #361 (flagged for owner veto)
agentKv:blob:*(provider-wire duplicates, 66 MB observed),checkpointId:*, andItemTable(holds live auth tokens — must never be ingested) are excluded from v1. Composer + bubble keys carry the full conversation content.Known limitations
events/search_documentsstorage remain possible for sessions mutated across a calendar-month boundary (partition key istoYYYYMM(ingested_at)). The primary transcript fetch dedups at read time (above);v_conversation_traceand search-side dedup need a follow-up (view rebuild migration). This is pre-existing behavior for hermessession_jsonrewrites;cursor_sqlitemakes it more likely, hence the read-side mitigation now.Validation
cargo test --workspace --locked— all green (incl. 13 newsqlite_pollunit tests: first-poll emission, no-op short-circuit, stable-UID re-emit on mutation, NULL/BLOB/TEXT storage classes, ghost gating, read-only-directory immutable fallback, schema-mismatch once-only + checkpoint-quiescence, 10k ceiling, createdAt deferral, URI escaping; golden fixture coverage for all bubble modes).make clippy(strict baseline),cargo fmt --all -- --check— clean.bash scripts/ci/e2e-stack.sh— full stack pass, including new cursor-sqlite assertions: 4 raw rows / 5 events / 5 distinct kinds / title extraction / 2 tool_io on first ingest, then a live-update phase (INSERT bubble + rewrite composer against the running stack) asserting 6 events andsession_metastill collapsed to 1; MCP smoke passes for all 8 sources.read_file_v2,edit_file_v2, MCP browser tools), 16 reasoning events, max payload 10.6 KB (screenshot elision working), zero ingest_errors, 12 ghost composers correctly absent, workspace-storage DBs (no relevant keys) polled cleanly via the immutable fallback.Closes #361.
🤖 Generated with Claude Code