-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QA] Score tests improvements #935
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
…y/vizro into score_tests_improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these enhancements Alexey. There are a few comments, but other than that, it's all good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I really like this whole Score tests logic you created! 💯 I think it's ready to be checked while we touch the vizro-ai code, and can help track the performance more easily.
I just have some minor questions and suggestions. Overall it's really cool!
…ests_improvements
for more information, see https://pre-commit.ci
…y/vizro into score_tests_improvements
Huge thanks for all of your comments! |
#temporary for development | ||
pull_request: | ||
branches: | ||
- main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to remove this before merging.
@@ -38,11 +40,18 @@ def setup_test_environment(): | |||
chromedriver_autoinstaller.install() | |||
|
|||
|
|||
# If len() is 0, it means that nothing was entered for this score in config, | |||
# in this case in should be 1.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# in this case in should be 1.0. | |
# in this case it should be 1.0. |
def score_calculator(score_name): | ||
return statistics.mean(score_name) if len(score_name) != 0 else 1.0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions here:
-
Intuitively, the
score_name
variable name looks more like a string. Do you think that renaming it to something like:
def score_calculator(metric_scores: list[int]):
makes more sense? -
Also, is the
score_name
always a list of ints? -
Why do you assing
1.0
iflen(score_name) != 0
? -
Why float score
1.0
is mixed with integers in this file1
? Does it make sense to align it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Yes, thanks for suggestion
-
Yes
-
I've tried to explain it in the comment to the function
If len() is 0, it means that nothing was entered for this score in config, in this case it should be 1.0.
. E.g. if page doesn't have anycontrols
, the calculation ofcontols_types
will return empty[]
, in this case we need to replace it with a score value. If it will be 0, it will affect total score, will make it lower. By setting value to 1.0 we assume that everything is correct. -
Make sense, I will use
int
everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for addressing and explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement! 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a light review, looks good! I think this is going in the right direction! Left a few small comments. ⭐ 💯
In general I think my main comments are:
- Make adding tests as easy as possible (ideally just another prompt + expectation pair, nothing else)
- Make adding and removing models etc as easy as possible (ideally just some test parametrization or so)
- make the complexity of the dashboard a column in the report, and allow for individual names for newly added tests (such that we can have in the furture smth like,
easy_1
,easy_2
,easy_abc
, etc.
Other than that, I think it's exciting!
@@ -51,7 +51,7 @@ prep-release = [ | |||
pypath = "hatch run python -c 'import sys; print(sys.executable)'" | |||
test = "pytest tests {args}" | |||
test-integration = "pytest -vs --reruns 1 tests/integration --headless {args}" | |||
test-score = "pytest -vs --reruns 1 tests/score --headless {args}" | |||
test-score = "pytest -vs tests/score --headless {args}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would make sense to add a comment above on where to enter the API key. When I run the above command, then all tests fail, but only because the API key does not work
complex_prompt = """ | ||
<Page 1> | ||
Show me 1 table on the first page that shows tips and sorted by day | ||
Using export button I want to export data to csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that even possible @lingyielia? I am not sure the json schema for the button actually includes possible custom actions?
["gpt-4o-mini"], | ||
ids=["gpt-4o-mini"], | ||
[ | ||
"gpt-4o-mini", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about gpt-40
(not mini)?
|
||
@pytest.mark.medium_dashboard | ||
@pytest.mark.parametrize("model_name", ["gpt-4o-mini"], ids=["gpt-4o-mini"]) | ||
def test_medium_dashboard(dash_duo, model_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need individual tests for easy, medium and complex dashboards? Should this not be another parameter to a single test?
In general I have mentioned this before, I think it would be good to not have three specific dashboards, but rather any number of dashboards that belong to a tier (three tiers are fine). In the future we could then easily add more by adding simply a new pair of prompt + expectation, that should be the aim.
…ests_improvements
Description
numpy
libReference to potential complexity prompts improvements -> #935 (comment)
Notice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":