-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Observability] [AAD] Streamline the method of saving group information in alert document #183248
Comments
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
We have a different set of context variables for "Summary of alerts" action frequency. We don't have separate
|
We can introduce two fields - one for search use case, one for iterating over.
|
Update: @benakansara corrected me that this is only the case in context.*, not at the alerts as data level. |
The current state of observability rules for the following two fields:
|
As described by @maryam-saeidi in above comment, atm we are storing group info in Using this field in search could result in false positives, as seen in one of the examples below. If user filters alerts with |
Based on discussion offline with @jasonrhodes and @maryam-saeidi, and considering search returning false positives with current approach, I think we have two options to streamline saving group in alert document:
With this approach -
The downside which was also captured in RFC is mapping explosion. However, as @dgieselaar mentioned on slack - The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are) Even if there are 100s of rules configured by user, the set of group by fields would be quite limited (with overlapping group by fields between rules)
With this approach -
The downside with keeping only one My proposal would be option 1) with dynamic mapping enabled. |
@elastic/response-ops (@ymao1 @pmuellr ) Our team is revisiting the alert grouping field topic that we discussed a year ago (document). We are considering the possibility of using flatten fields with enabled dynamic mapping as mentioned above (second option in the document) considering the fact that now we have a setting to ignore dynamic fields above the field limit and with assuming the number of fields that a user would select for group by fields is probably manageable. Any feedback/concerns about selecting this approach? |
@maryam-saeidi small clarification: |
Example of index mapping
|
Dang, poking around the Kibana codebase, I can see security rules are using fields Not clear to me yet, but if there's still a plan to use |
I believe security had us change some things to Also, at the time, I believe KQL didn't support |
@pmuellr do you have specific objections around a dynamically mapped object which only accepts strings under
Do you have a link to the code where this happens? |
@benakansara I think |
I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!
I have more experience with What are the other downsides of |
I just did a search of the Kibana codebase in vscode for In any case, using a new field(s) for this seems wise :-) |
Happy to!
I really think this is a non-issue. Someone would programmatically need to generate random-ish grouping keys and create rules with them. Now maybe someone will do that... but I don't think the chance of that happening is bigger than let's say someone indexing into the
It doesn't show up in field caps so you cannot do things like autocomplete on them or verify the existence of a field ahead of time (which we need for ES|QL for instance). |
@dgieselaar I agree. |
Thanks @maryam-saeidi and @benakansara — I talked to Mary yesterday and said I was leaning toward using the dynamic mapping for this field because the benefits seem good and the risk of having a customer group alerts by hundreds of different fields seems small. To help bring some clarity to this conversation, I had a look at what our current situation is. It looks like for an index like {
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "2500"
},
"ignore_malformed": "true"
}
}
}
} When I do a search on that same index to see current fields:
I get 2123. So I think we need to ask whether we feel comfortable with that amount of room when introducing a dynamic field, especially if there could be other reasons these mappings could grow beyond just this one field. Can we bump this number up without cuasing issues? Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? How confident are we there? I think it's reasonable to assume that number likely would never approach 1000, but we should also know the story of what would happen if a customer did exceed this limit by grouping in an unexpected way. |
How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates? |
@jasonrhodes Thanks for sharing the information about the current limitations. I was wondering if it would be an option to enable dynamic mapping on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules) and see how many fields will be added. This would give a sample for the question: "Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? ". We can also use that instance to test manually and see this approach in action. By adding dynamic mapping, we will save the mappings of ECS fields two times in this case, one at the root level and one for this group by field which does not have an issue by default, just to keep it in mind in case there are possible improvements to reuse mappings at different levels. (I am not sure if there is such a feature.) Another topic to discuss is whether using dynamic mapping can cause an issue regarding the type of field that will be added dynamically. Would it be possible that the type that is added dynamically does not match the actual type of the field? Can it be an issue? If a user adds a group by field mistakenly, would that be a problem besides having an extra unused mapping? Is there a possibility that users add many group by fields mistakenly? If yes, What is the process of correction? (Similar to the question, "what would happen if a customer did exceed this limit by grouping in an unexpected way.") What would happen if the user changed the shape of their data and renamed the list of groups by fields by migrating to a new set of field names? Can we have a clean-up process for such a case? And, since we have |
@dgieselaar I think it is related to using the |
@maryam-saeidi are we required to use it? @pmuellr Is this legacy, or can we switch to ecs@mappings? We had this issue years ago when RAC started, and we had very long discussions where IIRC we agreed not to map all ECS fields by default, precisely because of this reason. |
@dgieselaar That is currently still the way we introduce ECS mappings into the alerts documents. I will bring up addressing this issue with the team. |
@ymao1 yes please, it would be great if we can address it on the short term - it seems counterproductive to have discussions about adding dozens of fields when we're needlessly creating explicit mappings for around 2000 of them of which we only use a handful by default. |
Feels like this conversation is beginning to spin its wheels a bit, putting us at risk of still not having a consistent and usable way to do what we're hoping to do. @ymao1 (cc @kobelb) it would probably be helpful to have a realistic assessment of whether that linked issue sounds viable in the short term. Observability folks, if we continue to include ~2000 ECS fields in the alerts index mappings, would we still be comfortable introducing another dynamically mapped field for |
AFAIK they don't show up in _field_caps, meaning we also cannot validate the presence of the field before running the query - which is (unfortunately) a necessity for ES|QL. They also don't support anything other than keywords and a subset of queries, which I expect to be mostly fine (that is, until someone uses a boolean or a long for grouping :) ). FWIW, I don't think we should block this based on the mappings issue. Ideally we just go with dynamically mapped objects, and if the field limit becomes an issue, we have something that will have a much bigger impact (switching to ecs@mapping) rather than us having to use workarounds like flattened fields. |
It should be fairly simple to run some queries against our overview cluster for a rough idea of how many group by fields are in use there. I defer to @andrewvc on the rest of these questions. I don't have a great sense of how big the risk is if we move forward with a dynamically mapped field in this kind of shared index space, but I'm also a bit nervous we could overthink things and spend too much time being defensive about it, when most of these questions might already be unanswered with regard to what happens in these same scenarios if the (mostly irrelevant) ECS mappings grow, etc. |
A possible alternative is using a painless script to filter out false positives. Example painless query
|
Coming back around to this discussion, thanks for the ping @maryam-saeidi ! My current thought is dynamic mapping for some objects in the mappings is probably ok, though we should obviously think it through:
Guessing that for these grouping fields, we'll be fine. The values are field names, right? So they'd basically become the same key under this new object. I'd guess within a project/deployment, the cardinality is fine. The problems you might expect would be with something like date-based field names, seems unlikely folks would be creating "random" key values. If we do this, I'm sure we'll do more of this :-). So it would be good to have some experience of what happens when we hit the limits. I suspect it's kinda silent, assuming we can set a limit and have ES continue, presumably ignoring new "fields". And thus an SDH-generator. |
Beyond the mappings, there was some thought about how the context variables would be accessed, and I think that's a good thing to think about as well. Seems like we would actually want an "ordered dictionary" kind of collection, since I think the current shape doesn't tell you the "order" of the grouping. But since neither JS nor ES support that, do we need a separate array of the keys in grouping order, so someone could iterate over them that way? JS actually does sort of support ordering of properties in objects, but I'd like to not depend on that, as you can lose the orderings in different ways. I'd like to see some mustache template examples accessing these fields. I think the relevance of the mustache fields is < the mappings; the mappings are hugely important, the mustache fields - we can improve over time, or probably find potentially verbose solutions to whatever shape they are given. But still something to think about. |
I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit). However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here. |
Ya, I guess we will need some estimate on the cardinality, and then increase the current max we have by that amount.
I believe the coorelation is how many fields they group by, over all rules in a single "index". So, if they only ever grouped by the same 3 fields over all their rules, there would be 3 new fields. |
@ymao1 @pmuellr if we run into the field limit we have #168497. That issue has been open for a year, and I'd like to use any objections around this as a forcing function to actually get that ticket done - the value of getting that over the finish line seems immense and immediately makes this discussion much simpler. I remember I spent a lot of time during RAC arguing about why statically mapping ECS fields in all AAD indices is a bad idea. The conclusion back then was that only Security AAD indices would do this so I'm not sure why this has now been applied to all AAD indices, including the Observability ones. I am pushing on this because it does not make sense to me to have a discussion about adding a few dozen of fields (tops) when we have the opportunity to cut back the amount of mapped fields by two orders of magnitude. @pmuellr your questions make sense, however, they are problems that exist today. I don't expect this feature to materially change the amount of mapped fields. For reference, the cardinality of |
@dgieselaar I'll move that issue back into triage and we'll see if we can prioritize it. |
Apologies for missing the previous pings. I'm +1 on reducing the current field usage along the lines @dgieselaar proposes and also on the dynamic mappings. I think it's the best balance of flexibility and performance. @pmuellr @ymao1 do we have any rough estimates in terms of effort / time to deliver here? |
Hi everyone, I created a PoC to test dynamic mapping and did some tests and here are my findings:
In general, I see the main issue with hitting the limit is not being able to search the new fields, but the alerts will still be generated, and we can see the data in the alert flyout (if we enable
|
We discussed this in a ResponseOps call last week, and concur. They are similar but different, and I think we will get a bit more experience with the dyanmic fields by starting with just the dynamic groups. Seems like we will want to add something to the framework alert writer, to catch the |
I'm catching up with the issue and one question I don't see asked is why we are not using the alert ECS mappings at the root level to accomplish this story? I'm sure there's a reason for it but it's not clear to me after reading the use cases on the GitHub issue..
The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?
Maybe is this requirement that needs something special. Is there something where we only want auto complete on the group by fields? We structured the alert documents in a way that this data can be surfaced at the root and then leverage this structure for maintenance windows and conditional actions. It would feel inconsistent if we provided multiple sources to accomplish the same thing. |
Yes, that is correct; we already saved the group by keyword ECS fields at the root level. The issue is related to handling groups that are not ECS fields. For example, in Otel data, we have k8s.cluster.name instead of orchestrator.cluster.name. |
Thanks @maryam-saeidi. I discussed with the team, and we feel comfortable if we find a way to implement this change for
If you're good within those constraints, we'd be happy to have you or someone else prototype this for the team to review. |
I'm +1 on what @mikecote proposes, I'm curious about how we'd enforce the guardrail, if we have a place where we can easily do that validation that's great, I'm just curious where it'd go. |
@mikecote FWIW, Mary already put up a POC here: #199298. Is your ask to include some kind of guardrails in that POC? FWIW, in SLOs the fields are called |
Echoing what Dario mentioned above, I did a PoC, and I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?
Good point! If we have a similar definition of saving group information in |
I don't have an idea at this time. I think that would be the last piece left for the PoC, to find a way to guardrail so we guarantee only a limited number of fields get mapped. Is that something you could take time to research? Would be curious to see what options exist on how this could be done in Kibana side or Elasticsearch side? |
Regarding adding a guardrail, I checked with the ES team and didn't find any option. I created a ticket to request adding this feature: elastic/elasticsearch#118223 |
@maryam-saeidi there's a similar issue: elastic/elasticsearch#113275. I would suggest to talk to Felix/Joe about what the state is of that one, but at least it seems like it's the same issue, but perhaps with a different suggested solution. To clarify, what you're asking for is a limit of dynamically mapped fields per object, e.g. |
@dgieselaar Yes, exactly. I will check the ticket you shared and talk to Felix/Joe about it. |
@maryam-saeidi We should also consider the anomaly detection rule type. I'm not sure if that uses grouping, but the partition and by fields should be stored in kibana.alert.group as well so we can also match anomaly detection alerts to streams/entities. |
@maryam-saeidi another question: why do we not show the index threshold rule type? it also supports grouping keys no? (I will admit that I don't really understand the difference between the ES Query rule type and the Index Threshold rule type) |
I think the main point of this ticket is to come up with a solution, and we can incrementally add it to all the observability rules that we see fit. So, I would focus on finding the solution first and adding it to a rule if we see it necessary should not be an issue.
That's another big question we can discuss separately, and Jason and I are aware of this challenge. (The topic of whether the distinction between stack and observability alerting rules makes sense). |
@maryam-saeidi Let's just add them to the list here, or do you see a good reason not to? The reason I brought that up is because it initially listed only Observability rule types and it led to confusion w/ the Response Ops team. |
Adding to the list is not an issue, but we need to check which fields we want to bring and what the implications are (in terms of having them under the grouping fields when a rule has a different field for it). To be clear, I am not opposed to it; I am just saying it needs a bit more discussion, and I would rather have a dedicated ticket for it. |
Summary of the discussion: We decided to move forward with this approach considering the following:
We’d also like to see telemetry collected on this to better understand the cardinality we’re dealing with, which will help us determine if/when we should implement a circuit breaker. (We can possibly implement this logic sooner if this ES feature request gets prioritized.) |
Currently we have
kibana.alert.instance.id
in all alerts that saves comma separated group values in the alert document. We would like to have a field that provides information in the form of {field, value} pair, and allows for individual {field, value} to be searchable/queryable in the alert document. The requirement of this field is discussed in the RFC here.Based on the discussion in above RFC, the Custom threshold rule saves group information in AAD with
kibana.alert.group
field which is an array of{ field: field-name, value: field-value }
.We need to streamline the method of saving group information in AAD across all Observability rules.
Use cases
Rules where group info should be saved in its dedicated field in alert document
kibana.alert.group
arraykibana.alert.group
arraykibana.alert.group
arraykibana.alert.group
arrayNeeds more discussion
Acceptance criteria
The text was updated successfully, but these errors were encountered: