-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerting models, task architecture RFC #216
base: main
Are you sure you want to change the base?
Conversation
- **AlertCheck**: Logs each evaluation of an alert, including results and notification statuses. | ||
|
||
2. **Task Execution**: | ||
- Use Celery to handle periodic checks and notifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just checking whether it should be temporal rather than celery (i don't even know how we'd choose 🙈)
alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)] | ||
|
||
group_of_chains = group( | ||
chain(check_alert.chunks(group, 10)) for group in alert_id_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we do implement this it definitely needs a comment - i think i understand what it's doing but i bet i'm wrong 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my concern as well, it looks a bit difficult, and that's why I suggest for + limit_concurrency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a set of scheduled tasks which try and keep the cache warm and we're not monitoring them well (IMHO as the person who left them not well monitored last time they were changed)
this is effectively the same mechanism for a different reason
we could overlap these mechanisms so that caching overall gets better too
but we should definitely learn from the past and make sure that we know how often these tasks run, when they fail, and have appropriate alerting in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.
I should look into it actually, because alerts that use cached stale values are not very good alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly I don't think we should have a cache task processing X per minute and an alert task processing X for per minute, since we're going to overload processing infrastructure
or at least we shouldn't ignore that we have both
the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)
we can always allow more, but can't restrict them (AWS has soft and hard limits to protect service, but you can ask them to increase soft limits)
(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)
Yeah, I agree with that, Mixpanel allows 50 alerts per client, probably there's a reason for that.
2 and 10 alerts seems a bit small though:) maybe anything above some number should be paid?
(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)
I'd prefer keeping it in the open source, at least while I'm participating - I might require an extra approval from my employer to contribute to EE directories .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.
We overhauled the query runners cache recently, so to get fresh data (and fill the cache) you can just use the execution mode CALCULATE_BLOCKING_ALWAYS
in the alert task.
alert_check = AlertCheck.objects.create( | ||
alert=alert, | ||
calculated_value=calculated_value, | ||
threshold_met=threshold_met, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're adding thresholds... i'd love to be able to add thresholds in the UI that don't alert just so I can get some nice visuals on insights without needing alerts
that has the benefit we can ship it as soon as its made and start benefitting and getting user feedback on the UX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, that's a nice idea... So basically a threshold in the UI could be an alert that doesn't notify anyone (yet) and doesn't even need to be checked/calculated. But you could easily turn on notifications for that threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, i want to see the threshold on the graph because that's useful but don't need an alert for every threshold
e.g. measuring web vitals, i want to colour the graphs because its hard to remember the 4 different thresholds (well, 8 because they have good, medium, poor) but I don't need an alert when they change cos they don't always indicate something i need to immediately react to
it's so cool we're working on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this!
- `created`: Time of the check. | ||
- `calculated_value`: Result of the check. | ||
- `anomaly_condition`: Copy of the condition for the alert in case original got updated. | ||
- `threshold_met`: Whether the condition was met. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe is_anomaly? My first impression was that threshold_met=true means the value is below the threshold.
Also, we might consider other anomaly conditions, not only threshold.
On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here. So what about check_result = ANOMALY | NORMAL_VALUE | ERROR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe trigger_condition
and trigger_condition_met
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trigger_condition
🤷 sounds good
trigger_condition_met
I'd prefer having an enum here, IMO it will need more values than 2. You think check_result is not good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "anomaly": Is there any background or reasoning for this term? I think it evokes other expectations (anomaly detection) and is not quite what we have here with strict set thresholds. I would actually vote to use another term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here.
We need to decide whether to send out a notification (= threshold met) yes or no. That is boolean.
Errors can be stored separately (see error_message
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any background or reasoning for this term?
It just popped in my head. What term would you prefer? I would say a value that is above or below a user-defined threshold is an anomaly.
We need to decide whether to send out a notification (= threshold met) yes or no.
- What if there's a problem with the alert or insight configuration, for example the user misconfigured something, we should notify users about that, no? So whether we send out a notification might depend not only on meeting the threshold.
- If there's an error with the calculation, what value would the "threshold_met" have? I would say it's neither true nor false, and I wouldn't like to have an optional boolean as it doesn't help readability.
- In the future alerts might get other anomaly detection mechanisms, so it won't be only about thresholds. That's why I think it's better not mention "threshold" here.
- Enums are cheap, IMO provide better readability, and are extensible, why do you think it's better to use a boolean?
Sorry, missed this comment last time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've built a couple of similar matchers in the past and they always evolve to support &&
and ||
and i always ended up returning something like {matches: bool, explanation: string }
- so familiarity with something similar is altering my opinion here (doesn't mean i'm right 😊)
i'd definitely have error
as a separate case, if there's an error the condition was neither met nor not met
in other parts of posthog we ended up having error_count
and last_error_date
so that we could decide whether to carry on trying something
tbh there's a predicate condition and a predicate result
maybe it's better to have it that generic
this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i should leave you @webjunkie to it 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's an error the condition was neither met nor not met
That worries me a bit, I usually find difficult to interpret optional booleans. My rule of thumb is that if something is slightly more difficult than yes/no then I default to an enum - it's more explicit and more extensible. I'm not advocating for different error types in the check_result. I'm just suggesting we use an enum to express this absence of the outcome more explicitly, have a required enum with three states.
And still "Enums are cheap, IMO provide better readability, and are extensible". It's better to have a door open to extensions, especially with databases, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing that comes to mind, which might influence the modeling here: We currently would alert each time we check, if the threshold is continued to be breached, right?
If we find the threshold is met, we might check previous AlertCheck
s to see if this was already the cases there and if a notification has been sent already.
Regarding the boolean discussion: I'm happy to talk about a specific alternative if you may present one.
this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing
This was also my thinking.. anomaly is a too strong word, and we can leave it more generic.. But naming is hard 🤷🏻
) | ||
|
||
if threshold_met: | ||
send_notification.s(alert_check.id).delay() # Launch notification task based on check object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder maybe we don't extract it into a separate task? The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again. I'd say profit is negligible assuming that notifications fail rarely 🤷
I'm not saying notifications should be in the same function with alerts though, I might as well extract it for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again
importantly
we won't calculate the metric value again.
that's a really valuable thing
some queries are inherently very slow, if we can avoid running them too often we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we absolutely should be able to retry a notification N times but only calculate the value once (or at least <N times since we don't want to retry for 6 hours without checking if the alert has recovered 🙈)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other reason I put this is that we can easily launch other things based on the trigger, e.g. hand it off to CDP, launch a Slack notification task, and so on. In this way it's prepared to be pluggable.
**Groups**: Allow multiple tasks to be executed in parallel, providing the ability to run concurrent checks for | ||
different alerts. By creating groups of tasks, we can limit the number of parallel operations to control system load | ||
and ensure efficient resource utilization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder maybe limit_concurrency (that you introduced) would work here? Basically that's all we need - limit the number of tasks in flight, no?
If we add limit_concurrency and simply process all alerts in a for
, IMO it would have the same effect as if we used chains + groups + chunks, but it would be simpler, or do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not.
e.g. 10 per worker is fine with 4 pods, then under load we scale to 100 pods and we're running 1000 concurrent not 40
(maybe teaching everyone to suck eggs 🙈)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not.
IIUC the limit_concurrency implementation, it looks at the number of tasks in flight and drops the new one if it's above the limit, so I'd say it will keep the limit if we increase the number of pods, right?
maybe teaching everyone to suck eggs 🙈
I haven't worked with Celery before, so not at all! What an odd expression though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What an odd expression though.
🤣
i have a bad habit of speaking in idioms - sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit concurrency implementation I did recently acts more as a safeguard. The reason is that if you put 100 tasks in the queue and they hit the limit, they will start to retry and retry but only do so a couple of times and then fail. So this system isn't good for "slam stuff in the queue and wait for it to be done".
Setting up groups and chains allows us to control that there are 10 groups launched (= 10 running at the same time) and they sequentially check the alerts. That's a simple way I would say to control how much is done at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right this would also break the chain 🤔
I also thought the same if we can limit worker concurrency... the problem is, this concurrency setting is for one worker. We might spin up several pods though. And they also might autoscale to even more. Unfortunately we currently cannot control the worker amount and autoscaling granular per worker type.
I can think about it more, but maybe the case that alerts fail from time to time is not that severe for a first version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there's a way to create a dedicated pod with exact number of CPUs for this queue. But yeah, I see it's not that simple. Let's try the chains and see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed possible, but someone would need to dig into the configs and templates in the other repo 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that our infra team pushed an unrelated adjustment to auto scaling, which now allows for exact worker/pod configuration of maximum instances. I'll test this with some other tasks and we might end up just using the same mechanism here then after all 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, cool!
What do you say we keep the mechanism with chains for now (to unblock development of other features and to merge that old PR) and if the configuration works out I'll adjust it?
alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)] | ||
|
||
group_of_chains = group( | ||
chain(check_alert.chunks(group, 10)) for group in alert_id_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my concern as well, it looks a bit difficult, and that's why I suggest for + limit_concurrency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.
I should look into it actually, because alerts that use cached stale values are not very good alerts.
No description provided.