Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[8.13] [Obs AI Assistant] Update evaluation framework (elastic#176914) (
elastic#177441) # Backport This will backport the following commits from `main` to `8.13`: - [[Obs AI Assistant] Update evaluation framework (elastic#176914)](elastic#176914) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Dario Gieselaar","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-02-21T13:32:42Z","message":"[Obs AI Assistant] Update evaluation framework (elastic#176914)\n\nThe following changes were made to the evaluation framework:\r\n- Adds support for screen context in the evaluation framework\r\n- Remove the checks against the LLM using mathematical operations in\r\nSTATS aggregations. This is now supported by ES|QL.\r\n- Allow specifying a different connector for evaluation (e.g. run with\r\nClaude, evaluate with GPT-4)\r\n\r\nThe `query` functions were improved as well:\r\n- For the `visualize_query` function, store `userOverrides` in the\r\nfunction response as data, rather than in the function request.\r\n`userOverrides` is a big chunk of data and I think it makes more sense\r\nto hide it from the LLM (ideally we'd just have the changed property but\r\nthat's probably hard).\r\n- Use `execute_query` instead of `visualize_query` if the user just\r\nwants to see results, and not visualize the data.\r\n- Add the ES|QL instructions as a user message, instead of a system\r\nmessage, to get the LLM to pay more attention to it in relation to the\r\nuser message.\r\n- Make sure `query` is also used for converting queries.\r\n- Fix a bug that occurred when multiple visualizations were available in\r\nthe conversation, editing any always resulted in the first visualization\r\nbeing updated.\r\n- Store `columns` in `data` rather than `content` to prevent it being\r\nsent over to the LLM.\r\n\r\nOne bugfix in the Bedrock/Claude adapter:\r\n- Catch, parse and throw errors in Bedrock stream (which come through as\r\nan object, not an error)\r\n\r\nSome APM changes:\r\n- Remove the APM-specific addition to the system message to have more\r\nconsistent performance. (I've not seen evidence of a degradation in\r\nperformance when calling APM-specific functions but would like a second\r\nopinion).\r\n- Add ES|QL queries to screen context for APM. This allows the Assistant\r\nto generate e.g. breakdowns of data that is on the page.\r\n- Add a `variance` scenario that generates data with some variations\r\naccording to a seasonal pattern. This is to get more realistic charts.\r\n\r\n---------\r\n\r\nCo-authored-by: almudenasanz <[email protected]>\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"ebb2c9d083bf2fe80923ca4fb191d4bf61e9b1eb","branchLabelMapping":{"^v8.14.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:obs-ux-infra_services","v8.13.0","v8.14.0"],"title":"[Obs AI Assistant] Update evaluation framework","number":176914,"url":"https://github.com/elastic/kibana/pull/176914","mergeCommit":{"message":"[Obs AI Assistant] Update evaluation framework (elastic#176914)\n\nThe following changes were made to the evaluation framework:\r\n- Adds support for screen context in the evaluation framework\r\n- Remove the checks against the LLM using mathematical operations in\r\nSTATS aggregations. This is now supported by ES|QL.\r\n- Allow specifying a different connector for evaluation (e.g. run with\r\nClaude, evaluate with GPT-4)\r\n\r\nThe `query` functions were improved as well:\r\n- For the `visualize_query` function, store `userOverrides` in the\r\nfunction response as data, rather than in the function request.\r\n`userOverrides` is a big chunk of data and I think it makes more sense\r\nto hide it from the LLM (ideally we'd just have the changed property but\r\nthat's probably hard).\r\n- Use `execute_query` instead of `visualize_query` if the user just\r\nwants to see results, and not visualize the data.\r\n- Add the ES|QL instructions as a user message, instead of a system\r\nmessage, to get the LLM to pay more attention to it in relation to the\r\nuser message.\r\n- Make sure `query` is also used for converting queries.\r\n- Fix a bug that occurred when multiple visualizations were available in\r\nthe conversation, editing any always resulted in the first visualization\r\nbeing updated.\r\n- Store `columns` in `data` rather than `content` to prevent it being\r\nsent over to the LLM.\r\n\r\nOne bugfix in the Bedrock/Claude adapter:\r\n- Catch, parse and throw errors in Bedrock stream (which come through as\r\nan object, not an error)\r\n\r\nSome APM changes:\r\n- Remove the APM-specific addition to the system message to have more\r\nconsistent performance. (I've not seen evidence of a degradation in\r\nperformance when calling APM-specific functions but would like a second\r\nopinion).\r\n- Add ES|QL queries to screen context for APM. This allows the Assistant\r\nto generate e.g. breakdowns of data that is on the page.\r\n- Add a `variance` scenario that generates data with some variations\r\naccording to a seasonal pattern. This is to get more realistic charts.\r\n\r\n---------\r\n\r\nCo-authored-by: almudenasanz <[email protected]>\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"ebb2c9d083bf2fe80923ca4fb191d4bf61e9b1eb"}},"sourceBranch":"main","suggestedTargetBranches":["8.13"],"targetPullRequestStates":[{"branch":"8.13","label":"v8.13.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.14.0","branchLabelMappingKey":"^v8.14.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/176914","number":176914,"mergeCommit":{"message":"[Obs AI Assistant] Update evaluation framework (elastic#176914)\n\nThe following changes were made to the evaluation framework:\r\n- Adds support for screen context in the evaluation framework\r\n- Remove the checks against the LLM using mathematical operations in\r\nSTATS aggregations. This is now supported by ES|QL.\r\n- Allow specifying a different connector for evaluation (e.g. run with\r\nClaude, evaluate with GPT-4)\r\n\r\nThe `query` functions were improved as well:\r\n- For the `visualize_query` function, store `userOverrides` in the\r\nfunction response as data, rather than in the function request.\r\n`userOverrides` is a big chunk of data and I think it makes more sense\r\nto hide it from the LLM (ideally we'd just have the changed property but\r\nthat's probably hard).\r\n- Use `execute_query` instead of `visualize_query` if the user just\r\nwants to see results, and not visualize the data.\r\n- Add the ES|QL instructions as a user message, instead of a system\r\nmessage, to get the LLM to pay more attention to it in relation to the\r\nuser message.\r\n- Make sure `query` is also used for converting queries.\r\n- Fix a bug that occurred when multiple visualizations were available in\r\nthe conversation, editing any always resulted in the first visualization\r\nbeing updated.\r\n- Store `columns` in `data` rather than `content` to prevent it being\r\nsent over to the LLM.\r\n\r\nOne bugfix in the Bedrock/Claude adapter:\r\n- Catch, parse and throw errors in Bedrock stream (which come through as\r\nan object, not an error)\r\n\r\nSome APM changes:\r\n- Remove the APM-specific addition to the system message to have more\r\nconsistent performance. (I've not seen evidence of a degradation in\r\nperformance when calling APM-specific functions but would like a second\r\nopinion).\r\n- Add ES|QL queries to screen context for APM. This allows the Assistant\r\nto generate e.g. breakdowns of data that is on the page.\r\n- Add a `variance` scenario that generates data with some variations\r\naccording to a seasonal pattern. This is to get more realistic charts.\r\n\r\n---------\r\n\r\nCo-authored-by: almudenasanz <[email protected]>\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"ebb2c9d083bf2fe80923ca4fb191d4bf61e9b1eb"}}]}] BACKPORT--> Co-authored-by: Dario Gieselaar <[email protected]>
- Loading branch information