Problem
Current BM25 path_channel and content_channel retrieval load all scoped chunks for the user namespace into Python memory before ranking. The scope is filtered by user, namespace, active document revision, exclusions, optional data_type, and optional signal_paths, but it is not first reduced by query terms in SQL.
This means large namespaces can pay roughly O(active scoped chunks) memory and CPU per BM25 channel per retrieve step. In workflow/agentic retrieval, decomposed queries may repeat this cost across multiple retrieve steps.
Current Behavior
Current path/content BM25 flow:
Postgres loads scoped chunks where search text is non-empty
-> Python filters rows by token intersection
-> rank_bm25 builds corpus in memory
-> Python BM25 scores and sorts
-> RRF fuses channels
term_channel already pushes query-token LIKE filters into SQL, but path_channel and content_channel do not currently use the existing content_search_tsv / path_search_tsv GIN-indexed fields as a candidate prefilter.
Proposed Direction
Use Postgres full-text search as a broad recall prefilter, then keep Python BM25 as the final channel reranker.
Target flow:
Build query tokens with the same tokenizer used by BM25
-> Postgres FTS OR prefilter on content_search_tsv/path_search_tsv
-> LIMIT to a generous candidate pool
-> Python BM25 reranks only candidate rows
-> existing RRF and agentic workflow remain unchanged
Important: the Postgres FTS predicate should match ANY query token, not require all tokens. Current Python BM25 admits rows with any query-token intersection, so an AND-style tsquery could hurt recall.
Suggested Implementation Notes
- Add a helper to build a safe OR tsquery from
tokenize_query_for_ranker(query).
- Use
content_search_tsv for content_channel and path_search_tsv for path_channel.
- Use Postgres ranking only for candidate ordering, not final ranking.
- Add a configurable candidate limit, for example default
RETRIEVAL_POSTGRES_FTS_CANDIDATE_LIMIT=2000.
- Fall back conservatively if query tokenization produces no safe tsquery or if FTS returns no candidates.
- Preserve existing public API response shape, RRF behavior, citations, result assembly, and workflow/agentic routing.
Acceptance Criteria
path_channel and content_channel no longer load every scoped chunk before BM25 ranking for normal non-empty queries.
- SQL uses the existing active document scope:
document.user_id
document.namespace
document.status = active
document.current_job_result_id = chunk.job_result_id
exclude_document_ids
exclude_sections
data_type chunk filters
signal_paths / filter_mode
- Prefilter uses OR semantics over query tokens to preserve current BM25 recall behavior.
- Python BM25 remains the final per-channel ranker after prefiltering.
- Candidate pool size is bounded and configurable.
- Add logs/metrics per channel:
- scoped/candidate count where available
- candidate limit
- ranked count
- duration
- fallback used
- Add contract tests covering:
- content channel FTS prefilter
- path channel FTS prefilter
- exclusions still apply
data_type filters still apply
- fallback behavior
- large synthetic corpus ranks only bounded candidates
Follow-Up Options
If this is still not enough, evaluate Postgres BM25 extensions or an external search engine. This issue should stay scoped to using Postgres FTS as a candidate generator while preserving Python BM25 reranking.
Problem
Current BM25
path_channelandcontent_channelretrieval load all scoped chunks for the user namespace into Python memory before ranking. The scope is filtered by user, namespace, active document revision, exclusions, optionaldata_type, and optionalsignal_paths, but it is not first reduced by query terms in SQL.This means large namespaces can pay roughly O(active scoped chunks) memory and CPU per BM25 channel per retrieve step. In workflow/agentic retrieval, decomposed queries may repeat this cost across multiple retrieve steps.
Current Behavior
Current path/content BM25 flow:
term_channelalready pushes query-tokenLIKEfilters into SQL, butpath_channelandcontent_channeldo not currently use the existingcontent_search_tsv/path_search_tsvGIN-indexed fields as a candidate prefilter.Proposed Direction
Use Postgres full-text search as a broad recall prefilter, then keep Python BM25 as the final channel reranker.
Target flow:
Important: the Postgres FTS predicate should match ANY query token, not require all tokens. Current Python BM25 admits rows with any query-token intersection, so an AND-style tsquery could hurt recall.
Suggested Implementation Notes
tokenize_query_for_ranker(query).content_search_tsvforcontent_channelandpath_search_tsvforpath_channel.RETRIEVAL_POSTGRES_FTS_CANDIDATE_LIMIT=2000.Acceptance Criteria
path_channelandcontent_channelno longer load every scoped chunk before BM25 ranking for normal non-empty queries.document.user_iddocument.namespacedocument.status = activedocument.current_job_result_id = chunk.job_result_idexclude_document_idsexclude_sectionsdata_typechunk filterssignal_paths/filter_modedata_typefilters still applyFollow-Up Options
If this is still not enough, evaluate Postgres BM25 extensions or an external search engine. This issue should stay scoped to using Postgres FTS as a candidate generator while preserving Python BM25 reranking.