test: add eval test setup W-19811634 W-19812303 #273

cristiand391 · 2025-10-02T20:22:49Z

What does this PR do?

Adds:

Discoverability tests

Adds tool prediction scorer based on Sentry's one with some updates:

updated sys prompt to avoid over-indexing on eval expectations (was causing too much false positives with score 1)
enforce expected tools are available to influence final scoring (will be blocked in code later)
scorer updated to send the full tool metadata (Sentry's one sends tool name + 1st description line) to enhance scoring based on parameters and examples availables

test failure example:

E2E eval tests

Do a full agent loop with the DX MCP tools, see the TESTING.md for more examples and the packages/mcp/test/evals/e2e/ tests.

Others:

updated the GHA workflow to only run on linux (E2E tool tests cover cross OS compat, eval tests aren't OS-dependant)
split eval tests in 2 types, each has its own vitest config/pjson script and run in separate CI jobs
Set up Gemini API key in repo

What issues does this PR fix or reference?

@W-19811634@
@W-19812303@

…zer_rule

…vals

[skip ci]

…vals

jfeingold35 and others added 6 commits September 26, 2025 09:45

@W-18964528@ Ported (non-working) code from old branch

a8ddbe5

@W-18964528@ Updated dependency

f853668

@W-18964528@ Implemented speculative E2E test for describe_code_analy…

cc4549c

…zer_rule

@W-18964528@ Added test coverage for run_code_analyzer

1ab1873

@W-18964528@ Added GHA for yarn:eval script

894aed7

test: add tool predicion scorer for light evals

85825a0

cristiand391 changed the title ~~test: add tool predicion scorer for light evals~~ test: add tool prediction scorer for light evals Oct 2, 2025

jfeingold35 force-pushed the jf/W-18964528-3 branch from 894aed7 to 81e0bf8 Compare October 8, 2025 17:15

cristiand391 and others added 6 commits October 10, 2025 09:08

chore: refactor + split eval tests

172c5c3

Merge remote-tracking branch 'origin/jf/W-18964528-3' into cd/light-e…

60df300

…vals

add TESTING.md

3e53cf7

[skip ci]

chore: use mcp binf

2cfddac

chore: update workflow + updates

0a4502c

update TESTING.md

ec6b95a

[skip ci]

cristiand391 marked this pull request as ready for review October 20, 2025 19:08

cristiand391 changed the title ~~test: add tool prediction scorer for light evals~~ test: add eval test setup W-19811634 W-19812303 Oct 20, 2025

cristiand391 added 3 commits October 20, 2025 16:20

chore: remove unused files

4a4db8c

[skip ci]

chore: update CI workflow + tool pred. scorer loads all tools

7230418

[skip ci]

Merge remote-tracking branch 'origin/jf/W-18964528-3' into cd/light-e…

7a39790

…vals

cristiand391 requested a review from a team as a code owner October 27, 2025 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: add eval test setup W-19811634 W-19812303 #273

test: add eval test setup W-19811634 W-19812303 #273

Uh oh!

cristiand391 commented Oct 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

test: add eval test setup W-19811634 W-19812303 #273

Are you sure you want to change the base?

test: add eval test setup W-19811634 W-19812303 #273

Uh oh!

Conversation

cristiand391 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Discoverability tests

E2E eval tests

What issues does this PR fix or reference?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cristiand391 commented Oct 2, 2025 •

edited

Loading