Skip to content

feat(mcp): add Prompts API and ragflow_list_datasets tool for intelligent retrieval routing#13440

Open
Louisym wants to merge 4 commits intoinfiniflow:mainfrom
Louisym:my-branch
Open

feat(mcp): add Prompts API and ragflow_list_datasets tool for intelligent retrieval routing#13440
Louisym wants to merge 4 commits intoinfiniflow:mainfrom
Louisym:my-branch

Conversation

@Louisym
Copy link

@Louisym Louisym commented Mar 6, 2026

Summary

This PR makes the RAGFlow MCP Server "white-box" to connected LLM clients by implementing
the MCP Prompts specification and adding a dataset discovery tool.

Before: clients saw a single opaque ragflow_retrieval tool with no guidance on which
knowledge bases exist, how to tune parameters, or how to recover from failures.

After: clients can request a live SOP prompt (ragflow_retrieval_skill) that injects:

  • A real-time snapshot of all available knowledge bases (id, name, description, doc/chunk counts)
  • Parameter tuning guide for RAGFlow's hybrid retrieval (Dense + BM25 + RRF)
  • Routing decision rules (precise KB targeting vs. global fallback)
  • Self-healing rules (empty results → lower threshold; 401 → check API key; 500 → retry once)

Changes

mcp/server/server.py

  • New tool: ragflow_list_datasets — lets the LLM proactively refresh the KB list and
    get full metadata (name, document_count, chunk_count, embedding_model, language)
  • New: @app.list_prompts() — exposes ragflow_retrieval_skill prompt with optional
    intent argument (precise / broad / auto)
  • New: @app.get_prompt() — dynamically assembles a 4-module SOP prompt with a live
    KB snapshot on every call
  • Refactor: list_datasets() — extended return fields and extracted shared helpers
    to eliminate code duplication; also populates the dataset metadata cache as a side effect

test/benchmark/mcp_skill_eval.py (new)

  • A/B benchmark script comparing global search (baseline) vs. precise KB routing (with Skill)
  • Integrates ragas (context_precision, context_recall) with OpenAI gpt-4o as judge LLM
  • Reuses existing test/benchmark/ infrastructure (HttpClient, dataset helpers, metrics)

test/benchmark/wixqa_eval.py (new)

  • WixQA-based benchmark using Wix/WixQA
    (200 expert-written + 200 simulated customer support Q&A pairs)
  • Two-KB routing strategy: questions routed to the KB owning their source articles

Benchmark Results

Evaluated on 100 questions from Wix/WixQA
across 2 knowledge bases. Judged by gpt-4o via ragas.

Metric Baseline (Both KBs) With Routing Delta
Context Precision 0.543 0.554 +2.0%
Context Recall 0.287 0.300 +4.5%
Avg Latency (ms) 403.1 366.2 -9.2%
Errors 0 0

Latency percentiles:

Stat Baseline Routing
p50 357.0 ms 338.4 ms
p90 516.4 ms 479.1 ms
p95 632.9 ms 531.2 ms

Routing consistently reduces noise (precision ↑), improves information coverage (recall ↑),
and cuts latency at every percentile by eliminating unnecessary KB searches.

Note: gains are intentionally conservative — the two WixQA KBs share the same domain
(Wix support articles), making routing harder than typical multi-domain deployments.
In heterogeneous KB setups (e.g., HR + Legal + Engineering), the precision delta is
expected to be significantly larger.

Test Plan

  • Start MCP Server in self-host mode, connect via MCP Inspector
  • Verify list_prompts returns ragflow_retrieval_skill with intent argument
  • Verify get_prompt returns a prompt containing current KB list
  • Verify intent=precise vs intent=broad produces different threshold guidance
  • Verify ragflow_list_datasets tool returns full metadata JSON array
  • Run benchmark: OPENAI_API_KEY=xxx uv run python test/benchmark/wixqa_eval.py --base-url http://127.0.0.1:9380 --api-key ragflow-xxx

…gent retrieval routing

Add MCP Prompts support to the RAGFlow MCP server, enabling LLM clients
to white-box the retrieval process via standardized list_prompts and
get_prompt handlers.

Changes:
- mcp/server/server.py: implement list_prompts and get_prompt handlers
  exposing a ragflow_retrieval_skill prompt that assembles a 4-step SOP
  (dataset listing -> intent analysis -> retrieval -> answer synthesis)
  with dynamic dataset injection via ragflow_list_datasets tool call;
  add ragflow_list_datasets tool returning full KB metadata
- test/__init__.py: make test/ a proper Python package
- test/benchmark/mcp_skill_eval.py: A/B benchmark comparing global
  search (baseline) vs MCP-skill-guided routing using ragas metrics
- test/benchmark/wixqa_eval.py: WixQA-based benchmark (Wix/WixQA on
  HuggingFace, 100 questions across 2 KBs); routing improves context
  precision +2.0%, recall +4.5%, and reduces avg latency by 9.2%
- wixqa_benchmark_report.md: benchmark results report
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. 🌈 python Pull requests that update Python code 💞 feature Feature request, pull request that fullfill a new feature. labels Mar 6, 2026
Louisym added 3 commits March 10, 2026 18:34
…gent retrieval routing

- Implement list_prompts / get_prompt handlers exposing ragflow_retrieval_skill prompt
- Dynamically inject live KB snapshot, parameter tuning guide, routing rules, and self-healing rules
- Add ragflow_list_datasets tool for LLM-driven KB discovery
- Refactor list_datasets() with shared _fetch_datasets_raw() + _normalize_dataset_item() helpers
- Add ragas>=0.2.0 and openai>=1.0.0 to test dependency group
- Add test/benchmark/mcp_skill_eval.py: A/B benchmark (global search vs precise routing)
  evaluated on WixQA dataset with gpt-4o as ragas judge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

💞 feature Feature request, pull request that fullfill a new feature. 🌈 python Pull requests that update Python code size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant