Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarize test run rule evaluations in a separate table. #868

Draft
wants to merge 4 commits into
base: feat/worker-index-creation
Choose a base branch
from

Conversation

apognu
Copy link
Contributor

@apognu apognu commented Feb 21, 2025

This is a prototype of using the worker set up worked on in #855 to periodically save a test run rule execution statistics (for both the live and phantom versions) into a summary table, for faster querying.

@apognu apognu added enhancement New feature or request go Pull requests that update Go code labels Feb 21, 2025
@apognu apognu self-assigned this Feb 21, 2025
@apognu apognu force-pushed the feat/summarize-test-run-statistics branch from e58f76f to 8b64106 Compare February 21, 2025 16:25
Copy link
Contributor

@Pascal-Delange Pascal-Delange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All clear and cool.
The only thing I have to say is, if we're doing this, we might as well complete it and also do the pre-accounting of:

  • sanction check rule execution stats (goes inside the same stats table as the other rules, see a comment inline about the stable rule id)
  • decisions by status stats (this one is less painful that the decisions/rules stats, but it's still liable to get significantly slow in the face of high decision volumes)
    WDYT?

}
}

func (w *TestRunSummaryWorker) Work(ctx context.Context, job *river.Job[models.TestRunSummaryArgs]) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also want a define timeout on the job ? I'd say something in the ballpark of 20s to a minute would be reasonable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the next success condition when we hit the timeout?

Except for transient errors, it is fair to say that a run hitting the timeout will never succeed since the next run would have even more data to process, and so on and so forth.

That raises a question this PR did not address yet which is that of monitoring. How do we raise alerts if a test run summary lags too far behind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
By default what would happen now is, if the job consistently fails, we would get notified on sentry (at least I think so - it's worth double checking if a timeout specifically triggers a sentry alarm in the job logger middleware).
Your remark about the "doom loop" where if it fails, it is expected to fail consistently is valid, and a good reason to have a long-ish value for the timeout, at least something longer than a duration that seems realistic to wait for it (a few minutes may do the job)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I still think it's good to explicitly set the timeout, as not doing so will only implicitly inherit the default job timeout that is (not, in this case) set at the client level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set a 2-minute timeout for now. All timings will have to be defined, though (intervals and timeouts).

type ScenarioTestRunSummary struct {
Id string
RuleName string
RuleStableId string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning here about stable id if we also do the stats of sanction check rules: there is a bit of a dirty hack in the TestRunStatsByRuleExecution usecase method that computes a fake (uuid) "stable rule id" on the fly for sanction checks in live/test version. This is fine as long as we only compute them in real time, but we need to add a proper stable rule id on the sanction check if we also want to precompute them

@apognu
Copy link
Contributor Author

apognu commented Feb 25, 2025

All clear and cool. The only thing I have to say is, if we're doing this, we might as well complete it and also do the pre-accounting of:

* sanction check rule execution stats (goes inside the same stats table as the other rules, see a comment inline about the stable rule id)

* decisions by status stats (this one is less painful that the decisions/rules stats, but it's still liable to get significantly slow in the face of high decision volumes)
  WDYT?

Totally agree, let's talk about it today, I'll start looking at how it works.

@apognu
Copy link
Contributor Author

apognu commented Feb 25, 2025

I implemented the decision stats in the latest commit. I will have a look at sanctions stats later today.

@apognu apognu force-pushed the feat/summarize-test-run-statistics branch from 614b846 to afc4725 Compare February 25, 2025 12:25
@apognu apognu force-pushed the feat/worker-index-creation branch from abb4456 to 6572443 Compare February 25, 2025 12:53
@apognu apognu force-pushed the feat/summarize-test-run-statistics branch from afc4725 to b140861 Compare February 25, 2025 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants