Skip to content

feat: fold FAISS path in as canonical dense backend of retrieve (#36)#69

Closed
TKaltofen wants to merge 9 commits into
mloda-ai:mainfrom
TomKaltofen:feat/faiss-dense-retrieve-backend
Closed

feat: fold FAISS path in as canonical dense backend of retrieve (#36)#69
TKaltofen wants to merge 9 commits into
mloda-ai:mainfrom
TomKaltofen:feat/faiss-dense-retrieve-backend

Conversation

@TKaltofen

Copy link
Copy Markdown
Collaborator

Closes #36.

What

  • Dense backend: FaissDenseRetriever (retrieve_backend="faiss") folds the repo's existing FAISS retrieval path in behind the retrieve family contract: corpus and query embedded with the stage pipeline's deterministic HashEmbedder, in-memory IndexFlatIP, positive cosine scores best-first. The family now has lexical (bm25s, tfidf) and dense (faiss) backends under one contract. Uses the existing faiss extra.
  • Same row shape, same feature name, same rules: the retrieval stage also serves retrieved_passages ([{doc_id, text, score, rank}]) when index_path is present, with true cosine scores (the repo's embedders are unit-normalized, so FAISS squared L2 converts exactly) and the family's only-positive-scores rule, so a no-match query yields no passages on either path. The llm_response stage also serves generated_answer ({answer, citations}). Both stages yield the canonical name to an explicit connector selector, so mixed options cannot double-match.
  • Seam written down: the design doc's "Relationship to the stage pipeline" section now pins the seam (stages = build-your-own, connectors = bring an existing tool, the FAISS stage is the native dense path of retrieve, migration = option swap); connectors README and top-level README updated.

Verification

  • tests/connectors/retrieve/test_faiss_retriever.py: full inherited contract suite for the dense backend (20 tests).
  • tests/integration/test_stage_connector_parity.py: stage and connector answer the same query over the same documents with identical shape, ordering, and cosine scores; no-match parity; canonical-name and selector-key pinning; mixed-options yield.
  • tox green: 613 passed, 7 skipped; ruff format/check, mypy --strict, bandit all clean.

Review pass

Two independent reviews (Claude Opus agent, codex) ran on the first commit; the second commit addresses the accepted findings: contract-compatible stage scores with no-match filtering, blank metadata doc_id positional fallback, and yield-to-connector matching gates.

…structured, orchestrator) (#31)

feat: RAG connector families (retrieve, rerank, generate, graph_rag, structured, orchestrator)
* docs: add Phase 0 RAG connector survey and family map
#35) (#40)

* refactor: extract shared connectors/mixins.py and connectors/errors.py
…#33) (#41)

* docs: add connectors family-map README; link from top-level README

Phase 2 (#33): move the per-family connector prose from the top-level
README into rag_integration/feature_groups/connectors/README.md as a
family-map table (contract, backends, no-Docker concrete, pedigree) plus
per-family detail, cross-linked to docs/rag-connector-base-classes.md and
each family's contract suite. The top-level README now links there instead
of duplicating the detail (one source of truth).

* docs: clarify per-family selector keys and pedigree/dep wording in connectors README
…roups

Upstream mloda will validate PROPERTY_MAPPING defaults at class-definition
time (FeatureChainParser.validate_property_mapping_defaults called from
FeatureGroup.__init_subclass__). Add a repo-wide test that applies the same
invariant now: every strict default must be one of the key's accepted values.
Delegates to the upstream validator when the installed mloda provides it,
otherwise replicates the logic on FeatureChainParser helpers from 0.8.x.
Review findings: walk_packages silently ignores ImportError without onerror,
and the per-module except hid internal import regressions. Collect every
import failure and fail the test. Tighten the discovery floor to the current
count of 74 and add a validation_function case to the anti-vacuous guard.
Closes #37. The strategy issue (#25) proposed rag/ while PR #31 shipped
feature_groups/connectors/; this records the shipped path as canonical and
maps stale rag/ references. Also lists mixins.py and errors.py in the
package layout.
FaissDenseRetriever (retrieve_backend="faiss") runs the stage pipeline's
FAISS nearest-neighbor search behind the retrieve family contract: hash
embeddings, in-memory IndexFlatIP, positive cosine scores best-first.

The retrieval and llm_response stages now also emit the canonical
retrieved_passages / generated_answer shapes and serve those feature
names (gated on their defining options), so migrating between a stage
and a connector is an option swap, not a pipeline rewrite. Parity is
verified end to end by tests/integration/test_stage_connector_parity.py,
and the migration seam is written down in the design doc and READMEs.
… contract

Review follow-ups: the retrieval stage's canonical passages now carry
true cosine scores (1 - squared_L2 / 2 over the repo's unit-normalized
embeddings) and apply the family's only-positive-scores rule, so a
no-match query yields no passages on both paths; blank metadata doc_ids
fall back to the FAISS index position; and both stages yield the
canonical feature name to an explicit connector selector so mixed
options cannot double-match. Parity test extended accordingly.
@TKaltofen

Copy link
Copy Markdown
Collaborator Author

Opened against the wrong base repo; this branch's issue (#36) lives on the TomKaltofen fork. Reopening there.

@TKaltofen TKaltofen closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant