Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e6d447b
feat(#876): existing-only re-diarization migration (#946) + audio cac…
chipi Jun 10, 2026
65a9f22
infra(dgx): #948 — speaches-gb10 custom build + CDI spec + verify
chipi Jun 10, 2026
5091147
feat(dgx): shared resilience layer for whisper + diarize clients (#954)
chipi Jun 10, 2026
75453df
chore(dgx): point Whisper client at openai-whisper :8002 (temporary, …
chipi Jun 10, 2026
8d51781
test(dgx): real-socket integration tests for provider resilience (#954)
chipi Jun 10, 2026
2e45d0b
test(dgx): e2e coverage for the DGX whisper + diarize clients (#954)
chipi Jun 10, 2026
feae5f8
feat(config): allow pyannote diarization on the OpenAI Whisper path (…
chipi Jun 10, 2026
611ecfd
test(viewer): strongly test graph-UI logic layer (stores + utils)
chipi Jun 10, 2026
b85502e
test(viewer): graph component layer — @vue/test-utils mount harness +…
chipi Jun 10, 2026
132cbbb
test(viewer): app-wide push — API layer to 100% coverage
chipi Jun 10, 2026
2d0e9cb
test(viewer): app-wide push — stores layer coverage
chipi Jun 10, 2026
af2a1cc
test(viewer): app-wide push — utils layer to ~100%
chipi Jun 10, 2026
c8039e7
chore(viewer): ratchet #914 coverage gate up after the test push
chipi Jun 10, 2026
e52f5fa
test(viewer): app-wide push — non-graph component mount tests
chipi Jun 10, 2026
11ef98d
test(viewer): app-wide push — shell containers + shared dialogs
chipi Jun 11, 2026
525923d
chore(viewer): correct #914 gate floor for the broadened coverage scope
chipi Jun 11, 2026
963c39e
docs(viewer): consolidated test tier map + wire into testing strategy
chipi Jun 11, 2026
55dec17
docs(viewer): flag synthetic-corpus Tier-3 gap in TESTING.md
chipi Jun 11, 2026
bffcab8
fix(test): point Tier-3 validation walk at the corpus VERSION dir
chipi Jun 11, 2026
5b72b3f
fix(search): lance hits carry publish_date (date-filter parity) + sch…
chipi Jun 11, 2026
91db5e4
build(test): build-validation-index also builds the LanceDB two-tier …
chipi Jun 11, 2026
941dbb6
fix(search): lance hits carry source_id so "Show on graph" works (Tie…
chipi Jun 11, 2026
4ef766d
test(e2e): Tier-3 V6 — assert search is served by LanceDB hybrid, not…
chipi Jun 11, 2026
57a4f40
docs(search): lance⇄FAISS metadata parity + schema self-heal + hybrid…
chipi Jun 11, 2026
157f0fc
chore(viewer): raise graph episode cap 15→25 (interim, pending #967)
chipi Jun 11, 2026
3447d0c
feat(dgx): shared hardened HTTP client — TCP keepalive + Connection: …
chipi Jun 11, 2026
5de176b
test(viewer): update episode-cap assertion to match the 15→25 bump (#…
chipi Jun 11, 2026
ab7a0a9
fix(ci): lint-markdown ignores web/gi-kg-viewer/validation-results/
chipi Jun 11, 2026
1994ece
docs(dgx): add missing docstrings on CircuitBreaker.record_success/fa…
chipi Jun 11, 2026
409fc72
fix(dgx): dgx_http_client raises a friendly error when httpx is missing
chipi Jun 11, 2026
5e917cf
docs(wip): next-session handoff for landing feat/946
chipi Jun 11, 2026
8c06161
test: isolate the #947 audio cache per test (fixes chaos 404 e2e + gl…
chipi Jun 11, 2026
d3ae2f8
fix(viewer): revert gesture-overlay Escape-to-dismiss (broke Escape-c…
chipi Jun 11, 2026
62d04e7
fix(search): sanitise stored_schema_version path (CodeQL py/path-inje…
chipi Jun 11, 2026
7a8013b
docs(ci): log dismissal of 3 py/path-injection FPs in stored_schema_v…
chipi Jun 11, 2026
21b7b3b
docs(ci): log dismissal of CodeQL #363 (stored_schema_version line-sh…
chipi Jun 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 86 additions & 14 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,7 @@ MARKDOWNLINT_CLI_ARGS = "**/*.md" \
--ignore "data/eval/runs/**" \
--ignore "$(WEB_VIEWER_DIR)/playwright-report/**" \
--ignore "$(WEB_VIEWER_DIR)/test-results/**" \
--ignore "$(WEB_VIEWER_DIR)/validation-results/**" \
--ignore "tests/stack-test/playwright-report/**" \
--ignore "tests/stack-test/test-results/**" \
--config .markdownlint.json
Expand Down Expand Up @@ -1067,6 +1068,54 @@ redo-diarization:
echo "Re-deriving relational edges (corpus-wide SPOKEN_BY)..."; \
$(PYTHON) -m podcast_scraper.cli enrich-edges --output-dir "$${CORPUS_DIR}"

# Phase 1 of staged re-diarization (#947): download + cache RAW audio for the existing
# corpus episodes only, WITHOUT transcribing/diarizing. --reprocess-existing-only scopes
# to on-disk GUIDs; --reprocess-source forces the existing whisper episodes back through
# download; --pipeline-stage download_only stops before transcription. The audio lands in
# the durable cache (default .cache/audio, external to the corpus) so phase 2
# (migrate-diarization) reprocesses off the cache with no feed re-fetch. Set
# AUDIO_CACHE_IN_CORPUS=1 to store the cache inside the corpus for a portable snapshot.
# make download-audio CORPUS_DIR=<corpus> PROFILE=config/profiles/cloud_with_dgx_primary.yaml
download-audio:
@test -n "$${CORPUS_DIR:-}" || { echo "CORPUS_DIR required (corpus parent path)"; exit 1; }; \
test -n "$${PROFILE:-}" || { echo "PROFILE required (e.g. config/profiles/cloud_with_dgx_primary.yaml)"; exit 1; }; \
test -f "$${CORPUS_DIR}/feeds.spec.yaml" || { echo "Missing $${CORPUS_DIR}/feeds.spec.yaml"; exit 1; }; \
echo "Downloading + caching audio (existing-only, no transcription) for $${CORPUS_DIR}..."; \
$(HF_NET_ENV) $(PYTHON) -m podcast_scraper.cli \
--config "$${PROFILE}" \
--feeds-spec "$${CORPUS_DIR}/feeds.spec.yaml" \
--output-dir "$${CORPUS_DIR}" \
--skip-existing \
--reprocess-source whisper_transcription \
--reprocess-existing-only \
--pipeline-stage download_only \
$${AUDIO_CACHE_IN_CORPUS:+--audio-cache-in-corpus}

# Strict existing-only re-diarization MIGRATION (#876/#946). Like redo-diarization but
# adds --reprocess-existing-only: the episode set is restricted to GUIDs already on disk
# under each feed's run_*/metadata/. New feed items are dropped and max/offset/date caps
# are ignored, so the corpus NEVER grows -- a migration of existing data while ingestion
# is paused, not an ingest. Audio comes from the #947 raw-audio cache when present (run
# ``make download-audio`` first to prime it -> cache HIT, no feed fetch, and rolled-off
# episodes stay re-diarizable); otherwise it is re-fetched from the live feed enclosure
# and cached for next time. Aborts loudly if no on-disk GUIDs are found (mis-pointed
# CORPUS_DIR). Set AUDIO_CACHE_IN_CORPUS=1 to keep the cache inside the corpus.
# make migrate-diarization CORPUS_DIR=<corpus> PROFILE=config/profiles/cloud_with_dgx_primary.yaml
migrate-diarization:
@test -n "$${CORPUS_DIR:-}" || { echo "CORPUS_DIR required (corpus parent path)"; exit 1; }; \
test -n "$${PROFILE:-}" || { echo "PROFILE required (a diarization-enabled profile, e.g. config/profiles/cloud_with_dgx_primary.yaml)"; exit 1; }; \
test -f "$${CORPUS_DIR}/feeds.spec.yaml" || { echo "Missing $${CORPUS_DIR}/feeds.spec.yaml"; exit 1; }; \
echo "Migrating (existing-only) whisper_transcription episodes in $${CORPUS_DIR} under $${PROFILE}..."; \
$(HF_NET_ENV) $(PYTHON) -m podcast_scraper.cli \
--config "$${PROFILE}" \
--feeds-spec "$${CORPUS_DIR}/feeds.spec.yaml" \
--output-dir "$${CORPUS_DIR}" \
--skip-existing \
--reprocess-source whisper_transcription \
--reprocess-existing-only; \
echo "Re-deriving relational edges (corpus-wide SPOKEN_BY)..."; \
$(PYTHON) -m podcast_scraper.cli enrich-edges --output-dir "$${CORPUS_DIR}"

# Corpus upgrade-path framework (#862). Managed, idempotent migrations for moving a
# deployed corpus across releases (2.6 → 2.7 FAISS→LanceDB is step 0001). CORPUS_DIR
# selects the corpus parent. `upgrade-check` exits 2 when migrations are pending —
Expand Down Expand Up @@ -2081,6 +2130,13 @@ ci-ui-full:
echo ""; echo "=== ci-ui-full [$$(date '+%Y-%m-%d %H:%M:%S')] stack-test-ml-ci ==="; $(MAKE) stack-test-ml-ci; \
echo ""; echo "=== ci-ui-full DONE $$(date '+%Y-%m-%d %H:%M:%S') ==="; echo ""

# Synthetic validation corpus root — MUST include the FIXTURES_VERSION subdir
# (e.g. ``.../viewer-validation-corpus/v2``). The raw ``feeds/<feed>/metadata/``
# artifacts that serve-api computes episodes from live under the version dir, NOT
# the version-less parent — pointing the walk at the parent discovers 0 episodes
# (empty Library → every handoff spec fails on the first row-click).
VIEWER_VALIDATION_CORPUS := $(PWD)/tests/fixtures/viewer-validation-corpus/$(shell cat tests/fixtures/FIXTURES_VERSION 2>/dev/null || echo v2)

ci-ui-validation:
# Tier-3 real-backend validation walk (RFC-086 / ADR-095). Drives the
# viewer against a running ``make serve`` stack + an on-disk corpus.
Expand All @@ -2099,7 +2155,7 @@ ci-ui-validation:
echo " 1) Start server pointed at the repo root:"; \
echo " make serve-for-validation"; \
echo " 2) Run validation against the synthetic corpus:"; \
echo " make ci-ui-validation CORPUS=$(PWD)/tests/fixtures/viewer-validation-corpus"; \
echo " make ci-ui-validation CORPUS=$(VIEWER_VALIDATION_CORPUS)"; \
exit 2; \
fi
@echo "=== ci-ui-validation against $(CORPUS) ==="
Expand All @@ -2115,30 +2171,46 @@ serve-for-validation:
# target points at the repo root so ``tests/fixtures/viewer-validation-corpus``
# is reachable. Use a separate terminal: ``make serve-for-validation``.
@echo "Starting API + UI with SERVE_OUTPUT_DIR=$(PWD) (repo root)."
@echo "Synthetic validation corpus is at:"
@echo " $(PWD)/tests/fixtures/viewer-validation-corpus"
@echo "Synthetic validation corpus root (pass this as CORPUS=):"
@echo " $(VIEWER_VALIDATION_CORPUS)"
@SERVE_OUTPUT_DIR=$(PWD) $(MAKE) -j2 serve-api serve-ui

build-validation-index:
# Build the FAISS index + topic_clusters.json against the in-repo
# synthetic validation corpus so V2 (digest topic-band) and V4
# (dashboard topic-cluster chip) can be exercised. Run this BEFORE
# ``make serve-for-validation`` if you want V2/V4 to work — V1/V5
# do not require it.
@CORPUS=$(if $(CORPUS),$(CORPUS),$(PWD)/tests/fixtures/viewer-validation-corpus); \
echo "=== Building FAISS index at $$CORPUS/search ==="; \
# Build ALL search artifacts the Tier-3 walk needs, against the in-repo
# synthetic validation corpus. Run this BEFORE ``make serve-for-validation``.
# Unlocks the index-dependent specs (V1/V5 — Library/Digest handoffs — do NOT
# need any of this; they pass on the path fix alone):
# 1. FAISS flat index (vectors.faiss + sidecars) — required for V3
# (semantic search → Show on graph). Without it V3 is SKIPPED
# (test.skip on ``!indexJson.available``).
# 2. LanceDB two-tier index (search/lance_index/) — the current search
# layer on top of FAISS (native BM25 + hybrid RRF). serve-api uses it
# when present (hybrid_enabled default on) and falls back to FAISS if
# absent, so V3 passes either way — but building it makes the synthetic
# corpus match prod's two-tier layout and exercises the real hybrid path.
# 3. topic_clusters.json — required for V2 (digest topic-band) and V4
# (dashboard topic-cluster chip). Without it the Intelligence tab shows
# "Topic clusters not yet built" → no chips → V4 fails.
# All three live under <corpus>/search/ and are gitignored (binary,
# embedding-model-hash-keyed, regenerable) — never committed.
@CORPUS=$(if $(CORPUS),$(CORPUS),$(VIEWER_VALIDATION_CORPUS)); \
echo "=== 1/3 Building FAISS flat index at $$CORPUS/search ==="; \
$(PYTHON) -m podcast_scraper.cli index \
--output-dir $$CORPUS \
--rebuild --vector-faiss-index-mode flat
@CORPUS=$(if $(CORPUS),$(CORPUS),$(PWD)/tests/fixtures/viewer-validation-corpus); \
echo "=== Building topic_clusters.json at $$CORPUS/search ==="; \
@CORPUS=$(if $(CORPUS),$(CORPUS),$(VIEWER_VALIDATION_CORPUS)); \
echo "=== 2/3 Building LanceDB two-tier index at $$CORPUS/search/lance_index ==="; \
$(PYTHON) -m podcast_scraper.cli index-two-tier \
--output-dir $$CORPUS
@CORPUS=$(if $(CORPUS),$(CORPUS),$(VIEWER_VALIDATION_CORPUS)); \
echo "=== 3/3 Building topic_clusters.json at $$CORPUS/search ==="; \
$(PYTHON) -m podcast_scraper.cli topic-clusters \
--output-dir $$CORPUS \
--threshold 0.35
@echo ""
@echo "Done. Now run:"
@echo "Done — FAISS + LanceDB two-tier + topic_clusters built. Now run:"
@echo " make serve-for-validation (terminal 1)"
@echo " make ci-ui-validation CORPUS=\$$PWD/tests/fixtures/viewer-validation-corpus (terminal 2)"
@echo " make ci-ui-validation CORPUS=$(VIEWER_VALIDATION_CORPUS) (terminal 2)"

ci-clean: clean-all format-check lint lint-markdown type security complexity deadcode docstrings spelling check-test-policy preload-ml-models test test-ui test-ui-e2e build-viewer coverage-enforce docs build

Expand Down
4 changes: 4 additions & 0 deletions config/profiles/cloud_balanced.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@
# mitigations. gpt-4o-[mini-]transcribe blocked on chunking (#286).
transcription_provider: openai
openai_transcription_model: whisper-1
# #913: OpenAI Whisper (verbose_json) is eligible for the local pyannote diarization
# pass + named screenplay, but opt-in (off by default for cloud transcription). To
# produce a diarized/named screenplay here, set `diarize: true` — needs an HF token
# for pyannote and adds a local pyannote pass + the screenplay-reformat cascade (#925).

# Audio preprocessing — named preset in config/profiles/audio/ (#634 Scope 1).
# One edit to speech_optimal_v1.yaml updates every profile that references it.
Expand Down
3 changes: 3 additions & 0 deletions config/profiles/cloud_thin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@

transcription_provider: openai
openai_transcription_model: whisper-1
# #913: diarization (pyannote pass + named screenplay) is available on this openai
# path but opt-in (off by default) — set `diarize: true` to enable (needs an HF
# token; see cloud_balanced note).

audio_preprocessing_profile: speech_optimal_v1

Expand Down
14 changes: 10 additions & 4 deletions config/profiles/cloud_with_dgx_primary.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,16 @@ transcription:
fallback: openai

dgx_tailnet_host: dgx-llm-1.tail6d0ed4.ts.net
# Hugging Face repo ID (Speaches takes HF format). Pre-#814 this was the
# Ollama-style "whisper-large-v3" tag, which Speaches doesn't recognize →
# every transcribe call would 404 and force the cloud Whisper fallback.
dgx_whisper_model: Systran/faster-whisper-large-v3
# 2026-06-10: DGX Whisper temporarily switched from Speaches (:8000,
# faster-whisper) to an openai-whisper service on :8002 while the GB10
# Speaches/ctranslate2 path is sorted out. The :8002 service is
# OpenAI-API-compatible (GET /v1/models → "large-v3") and takes the short
# model id "large-v3" (NOT the HF "Systran/faster-whisper-large-v3" id).
# It is ~10-15× slower per episode than Speaches — fine for re-diarization,
# and the #946 duration-scaled timeout already covers it. Revert both lines
# to port 8000 / Systran/faster-whisper-large-v3 once Speaches is restored.
dgx_whisper_port: 8002
dgx_whisper_model: large-v3
transcription_fallback_provider: openai

openai_transcription_model: whisper-1
Expand Down
18 changes: 18 additions & 0 deletions docs/architecture/TESTING_STRATEGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,24 @@ directory).
| **In `make test`?** | No | Yes |
| **CI** | Job **`viewer-e2e`** (`.github/workflows/python-app.yml`); required for docs publish path on main | `test-e2e`, `test-e2e-fast`, etc. |

The viewer's **own** (TypeScript) test stack has more layers than this single
browser tier; the operator run-map — every tier, command, and corpus — lives in
**[`web/gi-kg-viewer/TESTING.md`](https://github.com/chipi/podcast_scraper/blob/main/web/gi-kg-viewer/TESTING.md)**.
In short:

- **Vitest unit + component** (`src/**/*.test.ts`; `@vue/test-utils` for `.vue`
mounts) under a v8 **coverage gate** (#914 — thresholds in `vite.config.ts`,
enforced by the CI `viewer-unit` job), a UI-coverage track parallel to the
Python coverage gate.
- **Tier-3 real-corpus validation walk** (`make ci-ui-validation CORPUS=…` via
`playwright.validation.config.ts`) — drives a live `make serve` stack against a
real corpus or the in-repo **synthetic validation corpus**
(`tests/fixtures/viewer-validation-corpus/`; see
[VIEWER_VALIDATION_CORPUS.md](https://github.com/chipi/podcast_scraper/blob/main/tests/fixtures/VIEWER_VALIDATION_CORPUS.md)
and [`e2e/validation/README.md`](https://github.com/chipi/podcast_scraper/blob/main/web/gi-kg-viewer/e2e/validation/README.md)).
Per RFC-086, a bug it surfaces must land a **Tier-2 matrix row** under
`e2e/handoff-production/` before the fix PR merges.

**Python API for the viewer** (`GET /api/search`, `/api/explore`, `/api/corpus/*`,
`POST /api/index/rebuild`, etc.) is validated at two pytest layers:

Expand Down
4 changes: 4 additions & 0 deletions docs/ci/CODEQL_DISMISSALS.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,10 @@ number, file, line, date, and a short comment.
| 1 | #355 | server/routes/corpus_media.py | 85 | 2026-06-06 | Type 1: ``verified`` from ``normpath_if_under_root(target, root_s)`` immediately before ``FileResponse``. Same shape as corpus_text_file #228. Dismissed ``gh api`` (PR #898) |
| 1 | #357 | server/routes/corpus_media.py | 53 | 2026-06-06 | Type 1: ``_safe_media_target_str`` sink — ``safe_relpath_under_corpus_root(base, norm)`` after the ``media/`` prefix + suffix-allowlist guards; route confines ``root`` via ``resolve_corpus_path_param``. Traversal tests pass. Dismissed ``gh api`` (PR #901) |
| 1 | #358 | server/routes/corpus_media.py | 115 | 2026-06-06 | Type 1: stem-extension resolve (``_resolve_existing_media``) — each candidate re-verified by ``normpath_if_under_root`` + ``realpath`` containment before ``isfile``/``FileResponse``. Same class as #355. Dismissed ``gh api`` (PR #901) |
| 1 | #360 | search/backends/lancedb_backend.py | 332 | 2026-06-11 | Type 1: ``meta_path`` via ``normpath_if_under_root`` after ``safe_resolve_directory`` + ``safe_relpath_under_corpus_root`` (constant ``index_meta.json``) before ``os.path.isfile`` in ``stored_schema_version`` — identical shape to ``read_index_meta`` #338/#342, corpus root route-confined by ``resolve_corpus_path_param``. Dismissed ``gh api`` (PR #969) |
| 1 | #361 | search/backends/lancedb_backend.py | 335 | 2026-06-11 | Type 1: same sanitizer chain, ``open(meta_path)`` sink in ``stored_schema_version`` (PR #969) |
| 1 | #362 | search/backends/lancedb_backend.py | 359 | 2026-06-11 | Type 1: same sanitizer chain, schema-version helper filesystem sink on the route-confined corpus root (PR #969) |
| 1 | #363 | search/backends/lancedb_backend.py | 341 | 2026-06-11 | Type 1: same sanitizer chain, ``stored_schema_version`` sink re-numbered after the fix line-shift; ``meta_path`` via ``normpath_if_under_root``. Same shape as #338/#342 (PR #969) |

## Still open (not yet dismissed)

Expand Down
6 changes: 6 additions & 0 deletions docs/guides/E2E_TESTING_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ file remains **pytest** E2E.
| **vs FastAPI unit tests** | `tests/unit/podcast_scraper/server/test_viewer_*.py` cover **`/api/*`** JSON contracts; use Playwright when behavior depends on the **SPA** |
| **vs Vitest** | `web/gi-kg-viewer/src/utils/*.test.ts` cover **pure TS logic** (parsing, merge, metrics); `make test-ui` (~150 ms, no browser). Use Playwright for **rendered UI behavior** |

**Full viewer test tier map** (Vitest unit + `@vue/test-utils` component, the
**#914 coverage gate**, the mocked Playwright e2e above, and the **Tier-3
real-corpus validation walk** `make ci-ui-validation` against a live `make serve`
stack): see **[`web/gi-kg-viewer/TESTING.md`](https://github.com/chipi/podcast_scraper/blob/main/web/gi-kg-viewer/TESTING.md)**
and the local **[`e2e/validation/README.md`](https://github.com/chipi/podcast_scraper/blob/main/web/gi-kg-viewer/e2e/validation/README.md)**.

### Debugging UI issues and interpreting failures

The surface map is the shared **contract for accessible names, regions, and user entry paths**. When
Expand Down
34 changes: 34 additions & 0 deletions docs/guides/SEMANTIC_SEARCH_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,40 @@ two-tier **`RetrievalLayer`** (BM25 + dense vector → RRF, with compound dedup)
Record); the KG's value is the relational edges above, consumed by viewer surfaces (PRD-033 /
RFC-094), not by ranking.

### Telling which backend served a query (hybrid vs FAISS fallback)

Because hybrid silently falls back to FAISS, "search works" does **not** mean hybrid is live —
a stale/broken `lance_index`, `hybrid_enabled: false`, or an embed failure all degrade to FAISS
invisibly. Two signals on `GET /api/search` distinguish them:

- **Score shape (definitive):** the hybrid path returns **RRF** scores `1/(60+rank)` — the top
hit is ≈ `0.016`–`0.03`, always **< 0.1**. FAISS returns a cosine similarity, top hit ≈ `1.0`.
- **Response fields:** hybrid sets `source_tier` (`insight` / `segment` / `aux`) on every hit
plus top-level `lift_stats` and `query_type`; the FAISS fallback leaves these unset/empty.

The Tier-3 walk's **V6** spec (`e2e/validation/real-corpus.spec.ts`) asserts exactly this, so a
regression to FAISS fails CI loudly. (Note: prod corpora carrying a legacy `lance_native/` dir
rather than `lance_index/` run on the FAISS fallback — reindex with `make index-two-tier` to
move them onto hybrid.)

### Metadata parity + schema versioning (lance ⇄ FAISS)

A hybrid hit must carry the **same consumer-facing metadata fields** a FAISS row carries, or
shared consumers break silently on the hybrid path only. Established contract:

- `publish_date` — the shared `since`/date filter drops any hit lacking it (this zeroed out the
digest topic-bands, which always pass a `since` bound, under lance).
- `source_id` — the viewer's "Show on graph" affordance resolves the graph node from it; missing
it = no handoff.

**Rule:** any new field a consumer reads off a search hit must be added to **both** backends
(`search/backend.py` docs + lance schema in `backends/lancedb_backend.py`, populated in
`two_tier_indexer.py` *and* `migration.py`), and you must **bump `LANCE_SCHEMA_VERSION`**. The
version is stamped into `search/lance_index/index_meta.json`; `lance_index_is_stale()` then makes
old indexes self-heal — the read path skips a stale index (→ FAISS) and reindex moments
(`build_two_tier_index`, `migrate_faiss_to_lance`, upgrade migration 0002) rebuild rather than
upsert into incompatible tables.

## When to use it

- Cross-episode questions: “What do my podcasts say about X?” without exact keywords.
Expand Down
Loading