Summarize test run rule evaluations in a separate table. #868

apognu · 2025-02-21T16:24:10Z

This is a prototype of using the worker set up worked on in #855 to periodically save a test run rule execution statistics (for both the live and phantom versions) into a summary table, for faster querying.

Pascal-Delange

All clear and cool.
The only thing I have to say is, if we're doing this, we might as well complete it and also do the pre-accounting of:

sanction check rule execution stats (goes inside the same stats table as the other rules, see a comment inline about the stable rule id)
decisions by status stats (this one is less painful that the decisions/rules stats, but it's still liable to get significantly slow in the face of high decision volumes)
WDYT?

pure_utils/map.go

repositories/scenario_testrun.go

models/scenario_testrun.go

Pascal-Delange · 2025-02-24T21:13:57Z

usecases/scheduled_execution/test_run_summary_job.go

+	}
+}
+
+func (w *TestRunSummaryWorker) Work(ctx context.Context, job *river.Job[models.TestRunSummaryArgs]) error {


do we also want a define timeout on the job ? I'd say something in the ballpark of 20s to a minute would be reasonable

What would be the next success condition when we hit the timeout?

Except for transient errors, it is fair to say that a run hitting the timeout will never succeed since the next run would have even more data to process, and so on and so forth.

That raises a question this PR did not address yet which is that of monitoring. How do we raise alerts if a test run summary lags too far behind?

Good question.
By default what would happen now is, if the job consistently fails, we would get notified on sentry (at least I think so - it's worth double checking if a timeout specifically triggers a sentry alarm in the job logger middleware).
Your remark about the "doom loop" where if it fails, it is expected to fail consistently is valid, and a good reason to have a long-ish value for the timeout, at least something longer than a duration that seems realistic to wait for it (a few minutes may do the job)

That being said, I still think it's good to explicitly set the timeout, as not doing so will only implicitly inherit the default job timeout that is (not, in this case) set at the client level.

I set a 2-minute timeout for now. All timings will have to be defined, though (intervals and timeouts).

repositories/scenario_testrun.go

Pascal-Delange · 2025-02-24T21:25:55Z

models/scenario_test_run_summary.go

+type ScenarioTestRunSummary struct {
+	Id           string
+	RuleName     string
+	RuleStableId string


A warning here about stable id if we also do the stats of sanction check rules: there is a bit of a dirty hack in the TestRunStatsByRuleExecution usecase method that computes a fake (uuid) "stable rule id" on the fly for sanction checks in live/test version. This is fine as long as we only compute them in real time, but we need to add a proper stable rule id on the sanction check if we also want to precompute them

usecases/scheduled_execution/test_run_summary_job.go

apognu · 2025-02-25T08:35:45Z

All clear and cool. The only thing I have to say is, if we're doing this, we might as well complete it and also do the pre-accounting of:

* sanction check rule execution stats (goes inside the same stats table as the other rules, see a comment inline about the stable rule id)

* decisions by status stats (this one is less painful that the decisions/rules stats, but it's still liable to get significantly slow in the face of high decision volumes)
  WDYT?

Totally agree, let's talk about it today, I'll start looking at how it works.

apognu · 2025-02-25T10:46:14Z

I implemented the decision stats in the latest commit. I will have a look at sanctions stats later today.

apognu added enhancement New feature or request go Pull requests that update Go code labels Feb 21, 2025

apognu self-assigned this Feb 21, 2025

apognu force-pushed the feat/summarize-test-run-statistics branch from e58f76f to 8b64106 Compare February 21, 2025 16:25

Pascal-Delange reviewed Feb 24, 2025

View reviewed changes

apognu force-pushed the feat/summarize-test-run-statistics branch from 614b846 to afc4725 Compare February 25, 2025 12:25

apognu force-pushed the feat/worker-index-creation branch from abb4456 to 6572443 Compare February 25, 2025 12:53

apognu added 4 commits February 25, 2025 13:55

Reschedule index creation with failed ones in case of errors.

e9f5346

Summarize test run rule evaluations in a separate table.

89e29f9

Improve design of test run summary job.

95197c7

Add decision stat summaries for test run and set worker timeout.

b140861

apognu force-pushed the feat/summarize-test-run-statistics branch from afc4725 to b140861 Compare February 25, 2025 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarize test run rule evaluations in a separate table. #868

Summarize test run rule evaluations in a separate table. #868

apognu commented Feb 21, 2025

Pascal-Delange left a comment

Pascal-Delange Feb 24, 2025

apognu Feb 24, 2025

Pascal-Delange Feb 25, 2025

Pascal-Delange Feb 25, 2025

apognu Feb 25, 2025

Pascal-Delange Feb 24, 2025

apognu commented Feb 25, 2025

apognu commented Feb 25, 2025

Summarize test run rule evaluations in a separate table. #868

Are you sure you want to change the base?

Summarize test run rule evaluations in a separate table. #868

Conversation

apognu commented Feb 21, 2025

Pascal-Delange left a comment

Choose a reason for hiding this comment

Pascal-Delange Feb 24, 2025

Choose a reason for hiding this comment

apognu Feb 24, 2025

Choose a reason for hiding this comment

Pascal-Delange Feb 25, 2025

Choose a reason for hiding this comment

Pascal-Delange Feb 25, 2025

Choose a reason for hiding this comment

apognu Feb 25, 2025

Choose a reason for hiding this comment

Pascal-Delange Feb 24, 2025

Choose a reason for hiding this comment

apognu commented Feb 25, 2025

apognu commented Feb 25, 2025