Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,55 @@ feature = Feature(
| Deduplication | `ExactHashImageDeduplicator`, `PerceptualHashImageDeduplicator`, `DifferenceHashImageDeduplicator` |
| Embedding | `MockImageEmbedder`, `HashImageEmbedder`, `CLIPImageEmbedder` |

## Connector families

Alongside the build-your-own stage pipeline, the `connectors/` package wraps
whole external open-source RAG tools under one mloda surface, organized into six
families by query-contract shape (retrieve, rerank, generate, graph_rag,
structured, orchestrator). You swap backends by changing options, not by
rewriting a pipeline.

The two layers share one seam: the FAISS `retrieval` stage is the native dense
path of the `retrieve` family (`retrieve_backend="faiss"`), and a stage and its
connector counterpart emit the same passage / answer row shape under the same
canonical feature name, so migrating between them is an option swap. See
"Relationship to the stage pipeline" in
[`docs/rag-connector-base-classes.md`](docs/rag-connector-base-classes.md).

See [`feature_groups/connectors/README.md`](rag_integration/feature_groups/connectors/README.md)
for the family map (per-family contract, backends, no-Docker concrete, and
pedigree), runnable examples, and links to the contract suites. The design
rationale is in [`docs/rag-connector-base-classes.md`](docs/rag-connector-base-classes.md).

```python
from mloda.user import mlodaAPI, Feature, Options, PluginCollector
from mloda_plugins.compute_framework.base_implementations.python_dict.python_dict_framework import (
PythonDictFramework,
)
from rag_integration.feature_groups.connectors.retrieve import Bm25sRetriever

feature = Feature(
"retrieved_passages",
options=Options(context={
"retrieve_backend": "bm25s",
"query_text": "cat pet",
"corpus": [
{"doc_id": "d1", "text": "A cat is an independent and curious pet."},
{"doc_id": "d2", "text": "Cars need regular engine oil and maintenance."},
],
"top_k": 3,
}),
)
results = mlodaAPI.run_all(
[feature],
compute_frameworks={PythonDictFramework},
plugin_collector=PluginCollector.enabled_feature_groups({Bm25sRetriever}),
)
```

Install a family's backend with `uv sync --extra connectors` (or `rerank` /
`graph` / `structured` / `orchestrator`).

## Installation

Clone the repository and install with uv:
Expand All @@ -140,6 +189,7 @@ To install only specific extras, use `uv sync --extra <name>`:
| `faiss` | FAISS vector indexing (`faiss-cpu`) |
| `advanced` | Presidio, sentence-transformers, joblib, Pillow, FAISS|
| `eval` | BEIR benchmark datasets, pandas, numpy |
| `graph` | networkx graph-RAG backend (`NetworkxGraphRag`) |
| `dev` | tox, pytest, ruff, mypy, bandit |

## CLI
Expand Down
282 changes: 282 additions & 0 deletions docs/rag-connector-base-classes.md

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,40 @@ eval = [
"pandas>=1.3.0",
"numpy>=1.21.0",
]
# Connector families: wrap external open-source RAG tools (see issue #25).
# retrieve family canonical backend: bm25s (MIT, numpy-only, zero-download).
connectors = [
"bm25s>=0.2.0",
]
# rerank family pedigree backend: FlashRank (Apache-2.0, ONNX, no torch). The
# canonical lexical reranker is pure-Python and needs no extra; this adds the
# neural cross-encoder backend (model downloads on first use).
rerank = [
"flashrank>=0.2.0",
]
# graph_rag family canonical backend: networkx (BSD, pure-Python, zero-download).
graph = [
"networkx>=3.0",
]
# structured family: sqlglot (MIT, pure-Python, zero-download) parses/validates
# the generated SQL; execution is on the stdlib sqlite3 module.
structured = [
"sqlglot>=25",
]
# orchestrator family canonical backend: Haystack 2.x (Apache-2.0). Its in-memory
# BM25 pipeline is zero-download (no model, no server), so it runs in CI.
orchestrator = [
"haystack-ai>=2.0",
]

[tool.setuptools.packages.find]
where = ["."]
include = ["rag_integration*"]

[tool.setuptools.package-data]
# Ship the orchestrator R2R fixture-stub's canned-response JSON in the wheel.
"rag_integration.feature_groups.connectors.orchestrator" = ["fixtures/*.json"]

[tool.pytest.ini_options]
testpaths = ["rag_integration", "tests"]
python_files = ["test_*.py"]
Expand Down
146 changes: 146 additions & 0 deletions rag_integration/feature_groups/connectors/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Connector families

The `connectors/` package wraps whole external open-source RAG tools under one
mloda surface, organized into families by **query-contract shape**. Each family
is a thin `Base<Family>Connector` FeatureGroup plus one or more concrete
backends gated by a per-family selector option (`retrieve_backend`,
`rerank_backend`, `generate_backend`, `graph_backend`, `structured_backend`,
`orchestrator_backend`), with an inheritable contract-test suite so a new
backend's test is a handful of adapter methods.

This sits alongside the build-your-own stage pipeline (`../rag_pipeline/`): the
stages let you assemble a pipeline step by step, the connectors let you drop in
one external tool that subsumes several steps. You swap retrievers, rerankers,
or generators by changing options, not by rewriting a pipeline.

For the design (how families are cut, the full landscape survey, and the base
classes) see [`docs/rag-connector-base-classes.md`](../../../docs/rag-connector-base-classes.md).
The shared cross-cutting mixins and error types live in [`mixins.py`](mixins.py)
and [`errors.py`](errors.py).

## Family map

The canonical concrete per family is the zero-download, deterministic backend
that anchors the CI contract suite. Pedigree tags: `real-lib-inmem` (a real
library running in-process), `fixture-stub` (deterministic stand-in, no model
download or server). The full survey in the design doc also uses
`real-lib-server` and `research-prototype`.

| Family | Reader contract (in -> out) | No-Docker concrete | Other backends | Pedigree of the anchor | Contract suite |
|---|---|---|---|---|---|
| [`retrieve`](retrieve/) | `query_text + corpus + top_k -> ranked passages` (`retrieved_passages: [{doc_id, text, score, rank}]`) | `Bm25sRetriever` (`bm25s`, zero-download lexical) | `TfidfRetriever` (vector-space lexical), `FaissDenseRetriever` (dense FAISS, `faiss` extra) | real-lib-inmem | [`retrieve_contract.py`](../../../tests/connectors/retrieve/retrieve_contract.py) |
| [`rerank`](rerank/) | `query_text + candidates + top_k -> reordered passages` (`reranked_passages`) | `LexicalReranker` (token overlap, zero-download) | `FlashRankReranker` (ONNX cross-encoder, `rerank` extra, CI-skip on model download) | fixture-stub | [`rerank_contract.py`](../../../tests/connectors/rerank/rerank_contract.py) |
| [`generate`](generate/) | `query_text + passages -> answer + citations` (`generated_answer: {answer, citations}`), grounded by construction | `ExtractiveResponder` (stdlib sentence extraction) | `TemplateResponder` (multi-citation template) | fixture-stub | [`generate_contract.py`](../../../tests/connectors/generate/generate_contract.py) |
| [`graph_rag`](graph_rag/) | `query_text + nodes + edges + top_k -> ranked passages` (`graph_passages`); query overlap + one-hop neighbour bonus | `AdjacencyGraphRag` (stdlib adjacency map, zero-download) | `NetworkxGraphRag` (`networkx`, `graph` extra); parity test pins identical ranking | fixture-stub | [`graph_rag_contract.py`](../../../tests/connectors/graph_rag/graph_rag_contract.py) |
| [`structured`](structured/) | `question + table -> SQL -> typed rows` (`structured_rows: {sql, rows}`); in-mem SQLite, single-SELECT `sqlglot` guard | `RuleBasedSql` (rule-based NL->SQL, `structured` extra) | `AggregateSql` (adds avg/min/max/sum intents) | fixture-stub | [`structured_contract.py`](../../../tests/connectors/structured/structured_contract.py) |
| [`orchestrator`](orchestrator/) | `query_text + corpus + top_k -> answer + documents` (internals opaque) (`orchestrated_answer: {answer, documents}`) | `HaystackOrchestrator` (Haystack 2.x BM25, offline, `orchestrator` extra) | `R2RFixtureOrchestrator` (file-fixture REST stub) | real-lib-inmem | [`orchestrator_contract.py`](../../../tests/connectors/orchestrator/orchestrator_contract.py) |

## Families in detail

### `retrieve` -- `query_text + corpus + top_k -> ranked passages`

Holds the vector-store / lexical / late-interaction backends (FAISS, Chroma,
bm25s, ColBERT, ...). The anchor `Bm25sRetriever` (`retrieve_backend="bm25s"`) is
BM25 lexical retrieval via `bm25s`: zero-download, deterministic, numpy/scipy.
`TfidfRetriever` (`retrieve_backend="tfidf"`) ranks the same corpus by TF-IDF
cosine similarity using the repo's deterministic embedder, also zero-download.
`FaissDenseRetriever` (`retrieve_backend="faiss"`, `--extra faiss`) is the
canonical **dense** backend: the same FAISS nearest-neighbor search the stage
pipeline's `retrieval` stage runs, folded in behind this contract (cosine over
the repo's deterministic hash embeddings, in-memory `IndexFlatIP`).

The FAISS `retrieval` stage and this family are one world: the stage serves the
same `retrieved_passages` shape from a pre-built on-disk index, so migrating
between stage and connector is an option swap, not a pipeline rewrite. See
"Relationship to the stage pipeline" in the
[design doc](../../../docs/rag-connector-base-classes.md) and the parity test
in [`tests/integration/test_stage_connector_parity.py`](../../../tests/integration/test_stage_connector_parity.py).

```python
from mloda.user import mlodaAPI, Feature, Options, PluginCollector
from mloda_plugins.compute_framework.base_implementations.python_dict.python_dict_framework import (
PythonDictFramework,
)
from rag_integration.feature_groups.connectors.retrieve import Bm25sRetriever

feature = Feature(
"retrieved_passages",
options=Options(context={
"retrieve_backend": "bm25s",
"query_text": "cat pet",
"corpus": [
{"doc_id": "d1", "text": "A cat is an independent and curious pet."},
{"doc_id": "d2", "text": "Cars need regular engine oil and maintenance."},
],
"top_k": 3,
}),
)
results = mlodaAPI.run_all(
[feature],
compute_frameworks={PythonDictFramework},
plugin_collector=PluginCollector.enabled_feature_groups({Bm25sRetriever}),
)
```

### `rerank` -- `query_text + candidates + top_k -> reordered passages`

Takes *candidates* in (already retrieved), not a corpus. `LexicalReranker`
(`rerank_backend="lexical"`) is pure-Python token overlap, zero-download.
`FlashRankReranker` (`rerank_backend="flashrank"`, `--extra rerank`) adds a real
ONNX cross-encoder; its model downloads on first use, so its test runs locally
and is skipped on CI.

### `generate` -- `query_text + passages -> answer + citations`

Returns prose plus citations, grounded by construction (every citation is one of
the supplied passages). `ExtractiveResponder` (`generate_backend="extractive"`)
does pure-Python sentence extraction with a single citation. `TemplateResponder`
(`generate_backend="template"`) selects top query-relevant sentences across
passages into a fixed template and cites every passage it drew from. LLM-backed
generators are pedigree backends for later.

### `graph_rag` -- `query_text + nodes + edges + top_k -> ranked passages`

Scores nodes by query overlap plus a one-hop neighbour bonus: a passage
connected to a relevant one is surfaced even with no query-term overlap.
`AdjacencyGraphRag` (`graph_backend="adjacency"`) applies the scoring over a
hand-built adjacency map with stdlib only. `NetworkxGraphRag`
(`graph_backend="networkx"`, `--extra graph`) does the same over `networkx`; a
parity test pins identical ranking, showing the contract is not tied to one
graph library.

### `structured` -- `question + table -> SQL -> typed rows`

Answers a natural-language question over a relational table. `RuleBasedSql`
(`structured_backend="rule_based"`, `--extra structured`) does rule-based NL->SQL
executed on stdlib `sqlite3`, with `sqlglot` validating the generated SQL is a
single top-level `SELECT`. Values are always bound parameters and identifiers
whitelisted. `AggregateSql` (`structured_backend="aggregate"`) adds aggregation
intents (avg/min/max/sum) on top of the count/filter/list intents.

### `orchestrator` -- `query_text + corpus + top_k -> answer + documents`

Wraps a whole external RAG framework as one connector (bring your existing
pipeline); the internals are the framework's. `HaystackOrchestrator`
(`orchestrator_backend="haystack"`, `--extra orchestrator`) runs a real Haystack
2.x in-memory BM25 pipeline, zero-download (no model, no server) so it runs in
CI. `R2RFixtureOrchestrator` (`orchestrator_backend="r2r"`) models a
server-shaped tool over a static JSON fixture, surfacing only canned documents
that are in the supplied corpus. Other server-shaped tools can follow the same
fixture-stub pattern.

## How a backend is selected

Each base gates on its selector option in `match_feature_group_criteria` (named
per family above; note `graph_rag` uses `graph_backend`, not
`graph_rag_backend`); backends declare disjoint selector values, so at most one
ever claims a given `Options`. An unknown backend matches nothing. The
base owns the cross-backend contract (option extraction, validation, assembly);
a concrete backend implements only its one ranking / generation hook.

## Install

```bash
uv sync --extra connectors # or --extra rerank / graph / structured / orchestrator
uv sync --extra faiss # dense retrieve backend (FaissDenseRetriever)
```
12 changes: 12 additions & 0 deletions rag_integration/feature_groups/connectors/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""Connector families: wrap external open-source RAG tools under one mloda surface.

Each family is a thin ``Base<Family>Connector`` FeatureGroup plus one or more
concrete backends, paired with an inheritable contract-test suite. Unlike the
stage pipeline under ``rag_pipeline/`` (build-your-own RAG from chained
FeatureGroups), a connector exposes a whole external retrieval/rerank/generate
tool through a single feature. See issue #25 for the family taxonomy and the
selection rationale.

Family axis is the query-contract shape; the canonical concrete per family is
the zero-download, deterministic backend that anchors the CI contract suite.
"""
36 changes: 36 additions & 0 deletions rag_integration/feature_groups/connectors/errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""Typed errors shared by the connector families.

All subclass ``ValueError`` (via :class:`ConnectorError`), so callers and
contract tests that catch ``ValueError`` keep working. Messages stay at the
raise site, where the per-family wording differs.
"""

from __future__ import annotations


class ConnectorError(ValueError):
"""Base for every connector-family validation / rejection error."""


class MissingOptionError(ConnectorError):
"""A required option is absent."""


class InvalidOptionError(ConnectorError):
"""An option has the wrong type or an unusable value."""


class DuplicateDocIdError(ConnectorError):
"""Two entries share an effective ``doc_id``."""


class RankingContractError(ConnectorError):
"""A backend ``_rank`` result violates the ranking contract."""


class GroundingError(ConnectorError):
"""An answer cites or surfaces something not in the supplied input."""


class SqlSafetyError(ConnectorError):
"""Backend SQL is unsafe or not a single bare ``SELECT``."""
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""The ``generate`` connector family: query + passages -> answer + citations."""

from __future__ import annotations

from rag_integration.feature_groups.connectors.generate.base import BaseGenerateConnector
from rag_integration.feature_groups.connectors.generate.extractive_responder import ExtractiveResponder
from rag_integration.feature_groups.connectors.generate.template_responder import TemplateResponder

__all__ = ["BaseGenerateConnector", "ExtractiveResponder", "TemplateResponder"]
21 changes: 21 additions & 0 deletions rag_integration/feature_groups/connectors/generate/_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""Shared text helpers for the ``generate`` family's no-LLM responders.

Both deterministic baselines (extractive and template) tokenize and
sentence-split the same way; this private module holds the single copy so the
two backends cannot drift apart. The helpers are intentionally tiny: an
ASCII-lowercase token set and a punctuation-based sentence splitter.
"""

from __future__ import annotations

import re

_TOKEN_RE = re.compile(r"[a-z0-9]+")

SENTENCE_RE = re.compile(r"[^.!?]+[.!?]?")
"""Punctuation-based sentence splitter (over-splits abbreviations)."""


def tokenize(text: str) -> set[str]:
"""Return the set of distinct lowercase ``[a-z0-9]+`` tokens in ``text``."""
return set(_TOKEN_RE.findall(text.lower()))
Loading
Loading