Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QA] Score tests improvements #935

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

[QA] Score tests improvements #935

wants to merge 17 commits into from

Conversation

l0uden
Copy link
Contributor

@l0uden l0uden commented Dec 24, 2024

Description

  • rewrote score calculation with numpy lib
  • added prompt text to report
  • added anthropic provider for an easy dashboard creation (it failed to build medium and complex dashboard)
  • added complex prompt to have score less than 1.0

Reference to potential complexity prompts improvements -> #935 (comment)

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

    • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
    • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
    • I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
    • I have not referenced individuals, products or companies in any commits, directly or indirectly.
    • I have not added data or restricted code in any commits, directly or indirectly.

@github-actions github-actions bot added the Vizro-AI 🤖 Issue/PR that addresses Vizro-AI package label Dec 24, 2024
Copy link
Contributor

@petar-qb petar-qb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these enhancements Alexey. There are a few comments, but other than that, it's all good

vizro-ai/tests/score/prompts.py Outdated Show resolved Hide resolved
vizro-ai/tests/score/prompts.py Show resolved Hide resolved
vizro-ai/tests/score/prompts.py Show resolved Hide resolved
vizro-ai/tests/score/prompts.py Show resolved Hide resolved
vizro-ai/tests/score/test_dashboard.py Outdated Show resolved Hide resolved
vizro-ai/tests/score/test_dashboard.py Outdated Show resolved Hide resolved
vizro-ai/tests/score/test_dashboard.py Outdated Show resolved Hide resolved
vizro-ai/tests/score/test_dashboard.py Show resolved Hide resolved
Copy link
Contributor

@lingyielia lingyielia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow I really like this whole Score tests logic you created! 💯 I think it's ready to be checked while we touch the vizro-ai code, and can help track the performance more easily.

I just have some minor questions and suggestions. Overall it's really cool!

.github/workflows/test-score-vizro-ai.yml Show resolved Hide resolved
.github/workflows/test-score-vizro-ai.yml Outdated Show resolved Hide resolved
vizro-ai/tests/score/prompts.py Show resolved Hide resolved
@l0uden
Copy link
Contributor Author

l0uden commented Jan 16, 2025

Huge thanks for all of your comments!
It is ready for review again

Comment on lines +11 to +14
#temporary for development
pull_request:
branches:
- main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove this before merging.

vizro-ai/tests/score/prompts.py Show resolved Hide resolved
@@ -38,11 +40,18 @@ def setup_test_environment():
chromedriver_autoinstaller.install()


# If len() is 0, it means that nothing was entered for this score in config,
# in this case in should be 1.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# in this case in should be 1.0.
# in this case it should be 1.0.

vizro-ai/tests/score/test_dashboard.py Show resolved Hide resolved
Comment on lines 45 to 47
def score_calculator(score_name):
return statistics.mean(score_name) if len(score_name) != 0 else 1.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions here:

  1. Intuitively, the score_name variable name looks more like a string. Do you think that renaming it to something like:
    def score_calculator(metric_scores: list[int]): makes more sense?

  2. Also, is the score_name always a list of ints?

  3. Why do you assing 1.0 if len(score_name) != 0?

  4. Why float score 1.0 is mixed with integers in this file 1? Does it make sense to align it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, thanks for suggestion

  2. Yes

  3. I've tried to explain it in the comment to the function If len() is 0, it means that nothing was entered for this score in config, in this case it should be 1.0.. E.g. if page doesn't have any controls, the calculation of contols_types will return empty [], in this case we need to replace it with a score value. If it will be 0, it will affect total score, will make it lower. By setting value to 1.0 we assume that everything is correct.

  4. Make sense, I will use int everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for addressing and explanation!

@lingyielia lingyielia self-requested a review January 17, 2025 18:07
Copy link
Contributor

@lingyielia lingyielia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement! 🎉

Copy link
Contributor

@maxschulz-COL maxschulz-COL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a light review, looks good! I think this is going in the right direction! Left a few small comments. ⭐ 💯

In general I think my main comments are:

  1. Make adding tests as easy as possible (ideally just another prompt + expectation pair, nothing else)
  2. Make adding and removing models etc as easy as possible (ideally just some test parametrization or so)
  3. make the complexity of the dashboard a column in the report, and allow for individual names for newly added tests (such that we can have in the furture smth like, easy_1, easy_2,easy_abc, etc.

Other than that, I think it's exciting!

@@ -51,7 +51,7 @@ prep-release = [
pypath = "hatch run python -c 'import sys; print(sys.executable)'"
test = "pytest tests {args}"
test-integration = "pytest -vs --reruns 1 tests/integration --headless {args}"
test-score = "pytest -vs --reruns 1 tests/score --headless {args}"
test-score = "pytest -vs tests/score --headless {args}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would make sense to add a comment above on where to enter the API key. When I run the above command, then all tests fail, but only because the API key does not work

complex_prompt = """
<Page 1>
Show me 1 table on the first page that shows tips and sorted by day
Using export button I want to export data to csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that even possible @lingyielia? I am not sure the json schema for the button actually includes possible custom actions?

["gpt-4o-mini"],
ids=["gpt-4o-mini"],
[
"gpt-4o-mini",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about gpt-40 (not mini)?


@pytest.mark.medium_dashboard
@pytest.mark.parametrize("model_name", ["gpt-4o-mini"], ids=["gpt-4o-mini"])
def test_medium_dashboard(dash_duo, model_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need individual tests for easy, medium and complex dashboards? Should this not be another parameter to a single test?

In general I have mentioned this before, I think it would be good to not have three specific dashboards, but rather any number of dashboards that belong to a tier (three tiers are fine). In the future we could then easily add more by adding simply a new pair of prompt + expectation, that should be the aim.

vizro-ai/tests/score/prompts.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Vizro-AI 🤖 Issue/PR that addresses Vizro-AI package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants