Skip to content

Optimize BM25 retrieval candidate loading with Postgres FTS prefilter #195

Description

@suguanYang

Problem

Current BM25 path_channel and content_channel retrieval load all scoped chunks for the user namespace into Python memory before ranking. The scope is filtered by user, namespace, active document revision, exclusions, optional data_type, and optional signal_paths, but it is not first reduced by query terms in SQL.

This means large namespaces can pay roughly O(active scoped chunks) memory and CPU per BM25 channel per retrieve step. In workflow/agentic retrieval, decomposed queries may repeat this cost across multiple retrieve steps.

Current Behavior

Current path/content BM25 flow:

Postgres loads scoped chunks where search text is non-empty
-> Python filters rows by token intersection
-> rank_bm25 builds corpus in memory
-> Python BM25 scores and sorts
-> RRF fuses channels

term_channel already pushes query-token LIKE filters into SQL, but path_channel and content_channel do not currently use the existing content_search_tsv / path_search_tsv GIN-indexed fields as a candidate prefilter.

Proposed Direction

Use Postgres full-text search as a broad recall prefilter, then keep Python BM25 as the final channel reranker.

Target flow:

Build query tokens with the same tokenizer used by BM25
-> Postgres FTS OR prefilter on content_search_tsv/path_search_tsv
-> LIMIT to a generous candidate pool
-> Python BM25 reranks only candidate rows
-> existing RRF and agentic workflow remain unchanged

Important: the Postgres FTS predicate should match ANY query token, not require all tokens. Current Python BM25 admits rows with any query-token intersection, so an AND-style tsquery could hurt recall.

Suggested Implementation Notes

  • Add a helper to build a safe OR tsquery from tokenize_query_for_ranker(query).
  • Use content_search_tsv for content_channel and path_search_tsv for path_channel.
  • Use Postgres ranking only for candidate ordering, not final ranking.
  • Add a configurable candidate limit, for example default RETRIEVAL_POSTGRES_FTS_CANDIDATE_LIMIT=2000.
  • Fall back conservatively if query tokenization produces no safe tsquery or if FTS returns no candidates.
  • Preserve existing public API response shape, RRF behavior, citations, result assembly, and workflow/agentic routing.

Acceptance Criteria

  • path_channel and content_channel no longer load every scoped chunk before BM25 ranking for normal non-empty queries.
  • SQL uses the existing active document scope:
    • document.user_id
    • document.namespace
    • document.status = active
    • document.current_job_result_id = chunk.job_result_id
    • exclude_document_ids
    • exclude_sections
    • data_type chunk filters
    • signal_paths / filter_mode
  • Prefilter uses OR semantics over query tokens to preserve current BM25 recall behavior.
  • Python BM25 remains the final per-channel ranker after prefiltering.
  • Candidate pool size is bounded and configurable.
  • Add logs/metrics per channel:
    • scoped/candidate count where available
    • candidate limit
    • ranked count
    • duration
    • fallback used
  • Add contract tests covering:
    • content channel FTS prefilter
    • path channel FTS prefilter
    • exclusions still apply
    • data_type filters still apply
    • fallback behavior
    • large synthetic corpus ranks only bounded candidates

Follow-Up Options

If this is still not enough, evaluate Postgres BM25 extensions or an external search engine. This issue should stay scoped to using Postgres FTS as a candidate generator while preserving Python BM25 reranking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions