Skip to content

Conversation

@cristiand391
Copy link
Member

@cristiand391 cristiand391 commented Oct 2, 2025

What does this PR do?

Adds:

Discoverability tests

Adds tool prediction scorer based on Sentry's one with some updates:

  • updated sys prompt to avoid over-indexing on eval expectations (was causing too much false positives with score 1)
  • enforce expected tools are available to influence final scoring (will be blocked in code later)
  • scorer updated to send the full tool metadata (Sentry's one sends tool name + 1st description line) to enhance scoring based on parameters and examples availables

test failure example:
Screenshot 2025-10-02 at 17 25 46

E2E eval tests

Do a full agent loop with the DX MCP tools, see the TESTING.md for more examples and the packages/mcp/test/evals/e2e/ tests.

Others:

  • updated the GHA workflow to only run on linux (E2E tool tests cover cross OS compat, eval tests aren't OS-dependant)
  • split eval tests in 2 types, each has its own vitest config/pjson script and run in separate CI jobs
  • Set up Gemini API key in repo

What issues does this PR fix or reference?

@W-19811634@
@W-19812303@

@cristiand391 cristiand391 changed the title test: add tool predicion scorer for light evals test: add tool prediction scorer for light evals Oct 2, 2025
@cristiand391 cristiand391 marked this pull request as ready for review October 20, 2025 19:08
@cristiand391 cristiand391 changed the title test: add tool prediction scorer for light evals test: add eval test setup W-19811634 W-19812303 Oct 20, 2025
@cristiand391 cristiand391 requested a review from a team as a code owner October 27, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants