[QA] Score tests improvements #935

l0uden · 2024-12-24T12:57:09Z

Description

rewrote score calculation with numpy lib
added prompt text to report
added anthropic provider for an easy dashboard creation (it failed to build medium and complex dashboard)
added complex prompt to have score less than 1.0

Reference to potential complexity prompts improvements -> #935 (comment)

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
- I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
- I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
- I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
- I have not referenced individuals, products or companies in any commits, directly or indirectly.
- I have not added data or restricted code in any commits, directly or indirectly.

…ests_improvements

for more information, see https://pre-commit.ci

…y/vizro into score_tests_improvements

petar-qb

I like these enhancements Alexey. There are a few comments, but other than that, it's all good

vizro-ai/tests/score/prompts.py

vizro-ai/tests/score/test_dashboard.py

lingyielia

Wow I really like this whole Score tests logic you created! 💯 I think it's ready to be checked while we touch the vizro-ai code, and can help track the performance more easily.

I just have some minor questions and suggestions. Overall it's really cool!

.github/workflows/test-score-vizro-ai.yml

vizro-ai/tests/score/prompts.py

…ests_improvements

for more information, see https://pre-commit.ci

…y/vizro into score_tests_improvements

l0uden · 2025-01-16T17:04:08Z

Huge thanks for all of your comments!
It is ready for review again

petar-qb · 2025-01-17T08:57:07Z

.github/workflows/test-score-vizro-ai.yml

+  #temporary for development
+  pull_request:
+    branches:
+      - main


Don't forget to remove this before merging.

vizro-ai/tests/score/prompts.py

petar-qb · 2025-01-17T09:06:41Z

vizro-ai/tests/score/test_dashboard.py

@@ -38,11 +40,18 @@ def setup_test_environment():
        chromedriver_autoinstaller.install()


+# If len() is 0, it means that nothing was entered for this score in config,
+# in this case in should be 1.0.


Suggested change

# in this case in should be 1.0.

# in this case it should be 1.0.

vizro-ai/tests/score/test_dashboard.py

petar-qb · 2025-01-17T09:18:31Z

vizro-ai/tests/score/test_dashboard.py

+def score_calculator(score_name):
+    return statistics.mean(score_name) if len(score_name) != 0 else 1.0
+


A few questions here:

Intuitively, the score_name variable name looks more like a string. Do you think that renaming it to something like:
def score_calculator(metric_scores: list[int]): makes more sense?

Also, is the score_name always a list of ints?

Why do you assing 1.0 if len(score_name) != 0?

Why float score 1.0 is mixed with integers in this file 1? Does it make sense to align it?

Yes, thanks for suggestion

Yes

I've tried to explain it in the comment to the function If len() is 0, it means that nothing was entered for this score in config, in this case it should be 1.0.. E.g. if page doesn't have any controls, the calculation of contols_types will return empty [], in this case we need to replace it with a score value. If it will be 0, it will affect total score, will make it lower. By setting value to 1.0 we assume that everything is correct.

Make sense, I will use int everywhere

Great, thanks for addressing and explanation!

lingyielia

Great improvement! 🎉

maxschulz-COL

Did a light review, looks good! I think this is going in the right direction! Left a few small comments. ⭐ 💯

In general I think my main comments are:

Make adding tests as easy as possible (ideally just another prompt + expectation pair, nothing else)
Make adding and removing models etc as easy as possible (ideally just some test parametrization or so)
make the complexity of the dashboard a column in the report, and allow for individual names for newly added tests (such that we can have in the furture smth like, easy_1, easy_2,easy_abc, etc.

Other than that, I think it's exciting!

maxschulz-COL · 2025-01-20T11:01:48Z

vizro-ai/hatch.toml

@@ -51,7 +51,7 @@ prep-release = [
 pypath = "hatch run python -c 'import sys; print(sys.executable)'"
 test = "pytest tests {args}"
 test-integration = "pytest -vs --reruns 1 tests/integration --headless {args}"
-test-score = "pytest -vs --reruns 1 tests/score --headless {args}"
+test-score = "pytest -vs tests/score --headless {args}"


Do you think it would make sense to add a comment above on where to enter the API key. When I run the above command, then all tests fail, but only because the API key does not work

maxschulz-COL · 2025-01-20T11:02:45Z

vizro-ai/tests/score/prompts.py

+complex_prompt = """
+<Page 1>
+Show me 1 table on the first page that shows tips and sorted by day
+Using export button I want to export data to csv


is that even possible @lingyielia? I am not sure the json schema for the button actually includes possible custom actions?

maxschulz-COL · 2025-01-20T11:18:20Z

vizro-ai/tests/score/test_dashboard.py

-    ["gpt-4o-mini"],
-    ids=["gpt-4o-mini"],
+    [
+        "gpt-4o-mini",


what about gpt-40 (not mini)?

maxschulz-COL · 2025-01-20T11:22:14Z

vizro-ai/tests/score/test_dashboard.py

+
+@pytest.mark.medium_dashboard
+@pytest.mark.parametrize("model_name", ["gpt-4o-mini"], ids=["gpt-4o-mini"])
+def test_medium_dashboard(dash_duo, model_name):


Why do we need individual tests for easy, medium and complex dashboards? Should this not be another parameter to a single test?

In general I have mentioned this before, I think it would be good to not have three specific dashboards, but rather any number of dashboards that belong to a tier (three tiers are fine). In the future we could then easily add more by adding simply a new pair of prompt + expectation, that should be the aim.

vizro-ai/tests/score/prompts.py

…ests_improvements

l0uden added 3 commits December 24, 2024 13:55

complex prompt, update report, improvements

98f6f4d

Merge branch 'main' of https://github.com/mckinsey/vizro into score_t…

2e71655

…ests_improvements

changelog

736254e

github-actions bot added the Vizro-AI 🤖 Issue/PR that addresses Vizro-AI package label Dec 24, 2024

pre-commit-ci bot and others added 4 commits December 24, 2024 12:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

99d0fd7

for more information, see https://pre-commit.ci

add anthropic creds

34a7ccf

Merge branch 'score_tests_improvements' of https://github.com/mckinse…

d96a6e6

…y/vizro into score_tests_improvements

fix report aggregated

7139fb5

l0uden marked this pull request as ready for review December 27, 2024 11:58

l0uden requested review from Joseph-Perkins, antonymilne, huong-li-nguyen, maxschulz-COL and lingyielia as code owners December 27, 2024 11:58

petar-qb assigned l0uden Jan 7, 2025

petar-qb approved these changes Jan 7, 2025

View reviewed changes

lingyielia requested changes Jan 9, 2025

View reviewed changes

.github/workflows/test-score-vizro-ai.yml Show resolved Hide resolved

.github/workflows/test-score-vizro-ai.yml Outdated Show resolved Hide resolved

vizro-ai/tests/score/prompts.py Show resolved Hide resolved

l0uden and others added 8 commits January 16, 2025 17:20

complex prompt and review fixes

459f482

Merge branch 'main' of https://github.com/mckinsey/vizro into score_t…

c11fcb4

…ests_improvements

[pre-commit.ci] auto fixes from pre-commit.com hooks

e60a104

for more information, see https://pre-commit.ci

added reruns

9be7bd8

Merge branch 'score_tests_improvements' of https://github.com/mckinse…

72099a4

…y/vizro into score_tests_improvements

change complex prompt

dafbb16

change complex prompt

06c239a

option for ',' separator in aggregated report

97a1019

petar-qb reviewed Jan 17, 2025

View reviewed changes

review changes

ff97b31

lingyielia self-requested a review January 17, 2025 18:07

lingyielia approved these changes Jan 17, 2025

View reviewed changes

maxschulz-COL approved these changes Jan 20, 2025

View reviewed changes

Merge branch 'main' of https://github.com/mckinsey/vizro into score_t…

88f77eb

…ests_improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QA] Score tests improvements #935

[QA] Score tests improvements #935

l0uden commented Dec 24, 2024 •

edited by petar-qb

Loading

petar-qb left a comment

lingyielia left a comment

l0uden commented Jan 16, 2025

petar-qb Jan 17, 2025

petar-qb Jan 17, 2025

petar-qb Jan 17, 2025

l0uden Jan 17, 2025

petar-qb Jan 17, 2025

lingyielia left a comment

maxschulz-COL left a comment

maxschulz-COL Jan 20, 2025

maxschulz-COL Jan 20, 2025

maxschulz-COL Jan 20, 2025

maxschulz-COL Jan 20, 2025

	# in this case in should be 1.0.
	# in this case it should be 1.0.

		def score_calculator(score_name):
		return statistics.mean(score_name) if len(score_name) != 0 else 1.0

[QA] Score tests improvements #935

Are you sure you want to change the base?

[QA] Score tests improvements #935

Conversation

l0uden commented Dec 24, 2024 • edited by petar-qb Loading

Description

Notice

petar-qb left a comment

Choose a reason for hiding this comment

lingyielia left a comment

Choose a reason for hiding this comment

l0uden commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lingyielia left a comment

Choose a reason for hiding this comment

maxschulz-COL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l0uden commented Dec 24, 2024 •

edited by petar-qb

Loading