feat: fold FAISS path in as canonical dense backend of retrieve (#36)#69
Closed
TKaltofen wants to merge 9 commits into
Closed
feat: fold FAISS path in as canonical dense backend of retrieve (#36)#69TKaltofen wants to merge 9 commits into
TKaltofen wants to merge 9 commits into
Conversation
…structured, orchestrator) (#31) feat: RAG connector families (retrieve, rerank, generate, graph_rag, structured, orchestrator)
…#33) (#41) * docs: add connectors family-map README; link from top-level README Phase 2 (#33): move the per-family connector prose from the top-level README into rag_integration/feature_groups/connectors/README.md as a family-map table (contract, backends, no-Docker concrete, pedigree) plus per-family detail, cross-linked to docs/rag-connector-base-classes.md and each family's contract suite. The top-level README now links there instead of duplicating the detail (one source of truth). * docs: clarify per-family selector keys and pedigree/dep wording in connectors README
…roups Upstream mloda will validate PROPERTY_MAPPING defaults at class-definition time (FeatureChainParser.validate_property_mapping_defaults called from FeatureGroup.__init_subclass__). Add a repo-wide test that applies the same invariant now: every strict default must be one of the key's accepted values. Delegates to the upstream validator when the installed mloda provides it, otherwise replicates the logic on FeatureChainParser helpers from 0.8.x.
Review findings: walk_packages silently ignores ImportError without onerror, and the per-module except hid internal import regressions. Collect every import failure and fail the test. Tighten the discovery floor to the current count of 74 and add a validation_function case to the anti-vacuous guard.
FaissDenseRetriever (retrieve_backend="faiss") runs the stage pipeline's FAISS nearest-neighbor search behind the retrieve family contract: hash embeddings, in-memory IndexFlatIP, positive cosine scores best-first. The retrieval and llm_response stages now also emit the canonical retrieved_passages / generated_answer shapes and serve those feature names (gated on their defining options), so migrating between a stage and a connector is an option swap, not a pipeline rewrite. Parity is verified end to end by tests/integration/test_stage_connector_parity.py, and the migration seam is written down in the design doc and READMEs.
… contract Review follow-ups: the retrieval stage's canonical passages now carry true cosine scores (1 - squared_L2 / 2 over the repo's unit-normalized embeddings) and apply the family's only-positive-scores rule, so a no-match query yields no passages on both paths; blank metadata doc_ids fall back to the FAISS index position; and both stages yield the canonical feature name to an explicit connector selector so mixed options cannot double-match. Parity test extended accordingly.
Collaborator
Author
|
Opened against the wrong base repo; this branch's issue (#36) lives on the TomKaltofen fork. Reopening there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #36.
What
FaissDenseRetriever(retrieve_backend="faiss") folds the repo's existing FAISS retrieval path in behind theretrievefamily contract: corpus and query embedded with the stage pipeline's deterministicHashEmbedder, in-memoryIndexFlatIP, positive cosine scores best-first. The family now has lexical (bm25s,tfidf) and dense (faiss) backends under one contract. Uses the existingfaissextra.retrievalstage also servesretrieved_passages([{doc_id, text, score, rank}]) whenindex_pathis present, with true cosine scores (the repo's embedders are unit-normalized, so FAISS squared L2 converts exactly) and the family's only-positive-scores rule, so a no-match query yields no passages on either path. Thellm_responsestage also servesgenerated_answer({answer, citations}). Both stages yield the canonical name to an explicit connector selector, so mixed options cannot double-match.retrieve, migration = option swap); connectors README and top-level README updated.Verification
tests/connectors/retrieve/test_faiss_retriever.py: full inherited contract suite for the dense backend (20 tests).tests/integration/test_stage_connector_parity.py: stage and connector answer the same query over the same documents with identical shape, ordering, and cosine scores; no-match parity; canonical-name and selector-key pinning; mixed-options yield.toxgreen: 613 passed, 7 skipped; ruff format/check, mypy --strict, bandit all clean.Review pass
Two independent reviews (Claude Opus agent, codex) ran on the first commit; the second commit addresses the accepted findings: contract-compatible stage scores with no-match filtering, blank metadata doc_id positional fallback, and yield-to-connector matching gates.