-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarize test run rule evaluations in a separate table. #868
base: feat/worker-index-creation
Are you sure you want to change the base?
Summarize test run rule evaluations in a separate table. #868
Conversation
e58f76f
to
8b64106
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All clear and cool.
The only thing I have to say is, if we're doing this, we might as well complete it and also do the pre-accounting of:
- sanction check rule execution stats (goes inside the same stats table as the other rules, see a comment inline about the stable rule id)
- decisions by status stats (this one is less painful that the decisions/rules stats, but it's still liable to get significantly slow in the face of high decision volumes)
WDYT?
} | ||
} | ||
|
||
func (w *TestRunSummaryWorker) Work(ctx context.Context, job *river.Job[models.TestRunSummaryArgs]) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also want a define timeout on the job ? I'd say something in the ballpark of 20s to a minute would be reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the next success condition when we hit the timeout?
Except for transient errors, it is fair to say that a run hitting the timeout will never succeed since the next run would have even more data to process, and so on and so forth.
That raises a question this PR did not address yet which is that of monitoring. How do we raise alerts if a test run summary lags too far behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
By default what would happen now is, if the job consistently fails, we would get notified on sentry (at least I think so - it's worth double checking if a timeout specifically triggers a sentry alarm in the job logger middleware).
Your remark about the "doom loop" where if it fails, it is expected to fail consistently is valid, and a good reason to have a long-ish value for the timeout, at least something longer than a duration that seems realistic to wait for it (a few minutes may do the job)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I still think it's good to explicitly set the timeout, as not doing so will only implicitly inherit the default job timeout that is (not, in this case) set at the client level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set a 2-minute timeout for now. All timings will have to be defined, though (intervals and timeouts).
models/scenario_test_run_summary.go
Outdated
type ScenarioTestRunSummary struct { | ||
Id string | ||
RuleName string | ||
RuleStableId string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning here about stable id if we also do the stats of sanction check rules: there is a bit of a dirty hack in the TestRunStatsByRuleExecution usecase method that computes a fake (uuid) "stable rule id" on the fly for sanction checks in live/test version. This is fine as long as we only compute them in real time, but we need to add a proper stable rule id on the sanction check if we also want to precompute them
Totally agree, let's talk about it today, I'll start looking at how it works. |
I implemented the decision stats in the latest commit. I will have a look at sanctions stats later today. |
614b846
to
afc4725
Compare
abb4456
to
6572443
Compare
afc4725
to
b140861
Compare
This is a prototype of using the worker set up worked on in #855 to periodically save a test run rule execution statistics (for both the live and phantom versions) into a summary table, for faster querying.