Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] [AAD] Streamline the method of saving group information in alert document #183248

Open
Tracked by #183516
benakansara opened this issue May 13, 2024 · 56 comments
Assignees
Labels
Feature:Alerting Team:obs-ux-management Observability Management User Experience Team

Comments

@benakansara
Copy link
Contributor

benakansara commented May 13, 2024

Currently we have kibana.alert.instance.id in all alerts that saves comma separated group values in the alert document. We would like to have a field that provides information in the form of {field, value} pair, and allows for individual {field, value} to be searchable/queryable in the alert document. The requirement of this field is discussed in the RFC here.

Based on the discussion in above RFC, the Custom threshold rule saves group information in AAD with kibana.alert.group field which is an array of { field: field-name, value: field-value }.

We need to streamline the method of saving group information in AAD across all Observability rules.

Use cases

  • The field should be searchable/queryable reliably without false positives
  • Auto-suggestion on KQL bar should suggest this field
  • Use in action template of "Summary of alerts" action frequency (described in comment below) without relying on index

Rules where group info should be saved in its dedicated field in alert document

  • ES Query rule - currently does not save group information
  • Custom threshold rule - currently has kibana.alert.group array
  • Metric threshold rule - currently has kibana.alert.group array
  • Log threshold rule - currently has kibana.alert.group array
  • SLO burn rate rule - currently has kibana.alert.group array
  • Inventory threshold rule

Needs more discussion

  • APM Latency threshold rule
  • APM Failed transaction rate threshold rule
  • APM Error count rule
  • Synthetics monitor status
  • Anomaly detection

Acceptance criteria

  • Have same field with same structure to save group information in alert document across all Observability rules
@benakansara benakansara added Feature:Alerting Team:obs-ux-management Observability Management User Experience Team labels May 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@benakansara
Copy link
Contributor Author

We have a different set of context variables for "Summary of alerts" action frequency. We don't have separate context.group or context.groupByKeys context variables. Instead we can access all AAD fields when creating action message. This would be one of the use cases for using group info from AAD. I used alerts.all.data variable to build alert action message. In this case, we need to rely on the index if AAD field has an array like structure.

{{#alerts.all.data}}
Host name: {{kibana.alert.group.0.value}}
Container ID: {{kibana.alert.group.1.value}}
{{/alerts.all.data}}

@benakansara
Copy link
Contributor Author

We can introduce two fields - one for search use case, one for iterating over.

  • kibana.alert.group as an array [ {field: field-name, value: field-value} ]
  • kibana.alert.groupByKeys as an object

@jasonrhodes
Copy link
Member

jasonrhodes commented Jun 6, 2024

The only problem I see there is I think kibana.alert.group is already used in some rule types, as a string -- other than that I am +1 on this idea, generally.

Update: @benakansara corrected me that this is only the case in context.*, not at the alerts as data level.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Oct 1, 2024

The current state of observability rules for the following two fields:

  • kibana.alert.group as an array [ {field: field-name, value: field-value} ]
  • kibana.alert.groupByKeys as an object
    • This field does not exist atm
Rule type kibana.alert.group
APM rule No group by fields
Inventory No group by fields, has a predefined list of fields that can be selected from
Metric threshold
Custom threshold
Log threshold
SLO burn rate
ES Query
Anomaly detection Didn't see a group field there, maybe we should check what fields exist in an anomaly job

@benakansara
Copy link
Contributor Author

As described by @maryam-saeidi in above comment, atm we are storing group info in kibana.alert.group as an array in some of the rules.

Using this field in search could result in false positives, as seen in one of the examples below.

If user filters alerts with kibana.alert.group.field: "service.name" and kibana.alert.group.value: *product* KQL filter on alerts search bar, they would expect to see services with only "product" in their name, but this would return anything that matches "product" in any of the kibana.alert.group.value values in the document. In below example using otel-demo data, I created a rule with group by on service.name and transaction.name, and using above filter returns services without "product" in their name because some transactions have "product" in transaction names.

Image

@benakansara
Copy link
Contributor Author

benakansara commented Oct 1, 2024

Based on discussion offline with @jasonrhodes and @maryam-saeidi, and considering search returning false positives with current approach, I think we have two options to streamline saving group in alert document:

  1. kibana.alert.groupByKeys or kibana.alert.groupings field as object type with dynamic mapping enabled

With this approach -

  • we won't need two separate fields - one for KQL bar and the other for searching
  • KQL auto suggestion will include field name kibana.alert.group.host.name as opposed to current kibana.alert.group.value
  • querying is possible

The downside which was also captured in RFC is mapping explosion. However, as @dgieselaar mentioned on slack - The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)

Even if there are 100s of rules configured by user, the set of group by fields would be quite limited (with overlapping group by fields between rules)

  1. kibana.alert.groupByKeys or kibana.alert.groupings field as flattened type

With this approach -

  • querying is possible

The downside with keeping only one flattened field is that we won't have KQL auto-suggestion. We need another existing kibana.alert.group array field which works for KQL auto-suggestion but can lead to misleading results in some cases.

My proposal would be option 1) with dynamic mapping enabled.

@maryam-saeidi
Copy link
Member

@elastic/response-ops (@ymao1 @pmuellr )

Our team is revisiting the alert grouping field topic that we discussed a year ago (document). We are considering the possibility of using flatten fields with enabled dynamic mapping as mentioned above (second option in the document) considering the fact that now we have a setting to ignore dynamic fields above the field limit and with assuming the number of fields that a user would select for group by fields is probably manageable.

Any feedback/concerns about selecting this approach?

@benakansara
Copy link
Contributor Author

benakansara commented Oct 1, 2024

@maryam-saeidi small clarification:
for search use case - If we use flattened type, we don't need dynamic mapping. If we keep field type object, we would need dynamic mapping
(updated my earlier comment to reflect this)

@benakansara
Copy link
Contributor Author

Example of index mapping

  1. object type with dynamic mapping
"mappings": {
    "properties": {
      "kibana.alert.groupByKeys": {
        "dynamic": true, 
         "properties": {}
    }
  }
}
  1. flattened type
"mappings": {
    "properties": {
      "kibana.alert.groupByKeys": {
        "type": "flattened"
    }
  }
}

@pmuellr
Copy link
Member

pmuellr commented Oct 1, 2024

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Not clear to me yet, but if there's still a plan to use kibana.alert.group (maybe not?, still reading comments above, thought I should point this out tho), this seems problematic in that we'd have mapping issues searching across o11y and security alerts indices.

@pmuellr
Copy link
Member

pmuellr commented Oct 1, 2024

I believe security had us change some things to flattened to make some searches easier/possible, but I think there are also some down-sides; maybe the values are always treated as strings? Feels like that wouldn't be an issue if we just want to track a group name and value that is always a string. Are there grouping values that could be dates, numbers, etc?

Also, at the time, I believe KQL didn't support nested fields, so that wasn't an option. Maybe it does today? Would we even need KQL support - basically for UX? I believe nested fields also aren't supported in ES|QL at the moment, which may be another good reason to not use nested.

@dgieselaar
Copy link
Member

@pmuellr do you have specific objections around a dynamically mapped object which only accepts strings under kibana.alert.grouping, similar how SLOs use slo.grouping? I think flattened comes with other downsides that I'd like us to avoid (e.g. it not showing up in field caps IIUC).

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Do you have a link to the code where this happens?

@dgieselaar
Copy link
Member

@benakansara I think groupByKeys gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).

@pmuellr
Copy link
Member

pmuellr commented Oct 1, 2024

do you have specific objections around a dynamically mapped object which only accepts strings

I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!

The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)

I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)

What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...

@pmuellr
Copy link
Member

pmuellr commented Oct 1, 2024

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Do you have a link to the code where this happens?

I just did a search of the Kibana codebase in vscode for kibana.alert.group - it shows both kibana.alert.group with field/value and id/index variants. This doesn't seem right to me, but I haven't investigated further.

In any case, using a new field(s) for this seems wise :-)

@dgieselaar
Copy link
Member

@pmuellr:

I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!

Happy to!

I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)

I really think this is a non-issue. Someone would programmatically need to generate random-ish grouping keys and create rules with them. Now maybe someone will do that... but I don't think the chance of that happening is bigger than let's say someone indexing into the .alerts index directly.

What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...

It doesn't show up in field caps so you cannot do things like autocomplete on them or verify the existence of a field ahead of time (which we need for ES|QL for instance).

@benakansara
Copy link
Contributor Author

I think groupByKeys gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).

@dgieselaar I agree. groupings or group sounds better to me. The idea of naming it groupByKeys comes from the fact that we have a context variable called context.groupByKeys and it would be good to have consistent naming (we couldn't use context.group as it was already present with string type).

@jasonrhodes
Copy link
Member

Thanks @maryam-saeidi and @benakansara — I talked to Mary yesterday and said I was leaning toward using the dynamic mapping for this field because the benefits seem good and the risk of having a customer group alerts by hundreds of different fields seems small. To help bring some clarity to this conversation, I had a look at what our current situation is. It looks like for an index like .internal.alerts-observability.metrics.alerts-default-000001, the field limit has been upped to 2500 as-is:

{
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": "2500"
        },
        "ignore_malformed": "true"
      }
    }
  }
}

When I do a search on that same index to see current fields:

curl -s -XGET "/.alerts-observability.metrics*/_field_caps?fields=*" | jq '.fields|length'

I get 2123. So I think we need to ask whether we feel comfortable with that amount of room when introducing a dynamic field, especially if there could be other reasons these mappings could grow beyond just this one field. Can we bump this number up without cuasing issues? Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? How confident are we there? I think it's reasonable to assume that number likely would never approach 1000, but we should also know the story of what would happen if a customer did exceed this limit by grouping in an unexpected way.

@dgieselaar
Copy link
Member

How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?

@maryam-saeidi
Copy link
Member

@jasonrhodes Thanks for sharing the information about the current limitations. I was wondering if it would be an option to enable dynamic mapping on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules) and see how many fields will be added. This would give a sample for the question: "Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? ". We can also use that instance to test manually and see this approach in action.

By adding dynamic mapping, we will save the mappings of ECS fields two times in this case, one at the root level and one for this group by field which does not have an issue by default, just to keep it in mind in case there are possible improvements to reuse mappings at different levels. (I am not sure if there is such a feature.)

Another topic to discuss is whether using dynamic mapping can cause an issue regarding the type of field that will be added dynamically. Would it be possible that the type that is added dynamically does not match the actual type of the field? Can it be an issue?

If a user adds a group by field mistakenly, would that be a problem besides having an extra unused mapping? Is there a possibility that users add many group by fields mistakenly? If yes, What is the process of correction? (Similar to the question, "what would happen if a customer did exceed this limit by grouping in an unexpected way.") What would happen if the user changed the shape of their data and renamed the list of groups by fields by migrating to a new set of field names? Can we have a clean-up process for such a case?

And, since we have fieldsForAAD that limits the fields we show in the alert table and auto-suggestions in the KQL bar, how would it work with dynamic mapping? (Would it work as expected if we add kibana.alert.groupings.* or something similar to that list?)

@maryam-saeidi
Copy link
Member

How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?

@dgieselaar I think it is related to using the .alerts-ecs-mappings component template for alerting indices, which has the mapping of all ECS fields statically.

@dgieselaar
Copy link
Member

@maryam-saeidi are we required to use it? @pmuellr Is this legacy, or can we switch to ecs@mappings? We had this issue years ago when RAC started, and we had very long discussions where IIRC we agreed not to map all ECS fields by default, precisely because of this reason.

@ymao1
Copy link
Contributor

ymao1 commented Oct 8, 2024

@dgieselaar That is currently still the way we introduce ECS mappings into the alerts documents. I will bring up addressing this issue with the team.

@dgieselaar
Copy link
Member

@ymao1 yes please, it would be great if we can address it on the short term - it seems counterproductive to have discussions about adding dozens of fields when we're needlessly creating explicit mappings for around 2000 of them of which we only use a handful by default.

@jasonrhodes
Copy link
Member

Feels like this conversation is beginning to spin its wheels a bit, putting us at risk of still not having a consistent and usable way to do what we're hoping to do. @ymao1 (cc @kobelb) it would probably be helpful to have a realistic assessment of whether that linked issue sounds viable in the short term.

Observability folks, if we continue to include ~2000 ECS fields in the alerts index mappings, would we still be comfortable introducing another dynamically mapped field for kibana.alert.grouping? Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?

@dgieselaar
Copy link
Member

dgieselaar commented Oct 8, 2024

Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?

AFAIK they don't show up in _field_caps, meaning we also cannot validate the presence of the field before running the query - which is (unfortunately) a necessity for ES|QL. They also don't support anything other than keywords and a subset of queries, which I expect to be mostly fine (that is, until someone uses a boolean or a long for grouping :) ).

FWIW, I don't think we should block this based on the mappings issue. Ideally we just go with dynamically mapped objects, and if the field limit becomes an issue, we have something that will have a much bigger impact (switching to ecs@mapping) rather than us having to use workarounds like flattened fields.

@jasonrhodes
Copy link
Member

It should be fairly simple to run some queries against our overview cluster for a rough idea of how many group by fields are in use there.

I defer to @andrewvc on the rest of these questions. I don't have a great sense of how big the risk is if we move forward with a dynamically mapped field in this kind of shared index space, but I'm also a bit nervous we could overthink things and spend too much time being defensive about it, when most of these questions might already be unanswered with regard to what happens in these same scenarios if the (mostly irrelevant) ECS mappings grow, etc.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Oct 31, 2024

A possible alternative is using a painless script to filter out false positives.

Example painless query
GET .alerts*/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "bool": {
                        "filter": [
                        {
                            "bool": {
                                "should": [
                                    {
                                    "query_string": {
                                        "fields": [
                                            "kibana.alert.group.value"
                                        ],
                                        "query": "container-name"
                                    }
                                    }
                                ],
                                "minimum_should_match": 1
                            }
                        },
                        {
                            "bool": {
                                "should": [
                                    {
                                    "match_phrase": {
                                        "kibana.alert.group.field": "host.name"
                                    }
                                    }
                                ],
                                "minimum_should_match": 1
                            }
                        }
                        ]
                    }
                },
                {
                    "script": {
                        "script": {
                            "source": "for (int i=0; i < doc['kibana.alert.group.field'].length; i++) { if (doc['kibana.alert.group.field'][i] == params.groupName && doc['kibana.alert.group.value'][i] == params.groupValue) return true;} return false;",
                            "params": {
                                "groupName": "host.name",
                                "groupValue": "container-name"
                            }
                        }
                    }
                }
            ]
        }
    }
}

@maryam-saeidi maryam-saeidi self-assigned this Nov 4, 2024
@pmuellr
Copy link
Member

pmuellr commented Nov 5, 2024

Coming back around to this discussion, thanks for the ping @maryam-saeidi !

My current thought is dynamic mapping for some objects in the mappings is probably ok, though we should obviously think it through:

  • what is the expected cardinality of the fields?
  • how are we currently dealing with the max # of fields index.mapping.total_fields.limit, and how does that change?
  • how are we currently dealing with the max # of dynamic fields index.mapping.total_fields.ignore_dynamic_beyond_limit, and how does that change?
  • how do we find out when we hit a limit?
  • what do we do when we hit a limit?
  • what happens when a limit is hit, and causes an SDH, and how do we figure out that's the problem?

Guessing that for these grouping fields, we'll be fine. The values are field names, right? So they'd basically become the same key under this new object. I'd guess within a project/deployment, the cardinality is fine. The problems you might expect would be with something like date-based field names, seems unlikely folks would be creating "random" key values.

If we do this, I'm sure we'll do more of this :-). So it would be good to have some experience of what happens when we hit the limits. I suspect it's kinda silent, assuming we can set a limit and have ES continue, presumably ignoring new "fields". And thus an SDH-generator.

@mikecote @ymao1 thoughts?

@pmuellr
Copy link
Member

pmuellr commented Nov 5, 2024

Beyond the mappings, there was some thought about how the context variables would be accessed, and I think that's a good thing to think about as well. Seems like we would actually want an "ordered dictionary" kind of collection, since I think the current shape doesn't tell you the "order" of the grouping. But since neither JS nor ES support that, do we need a separate array of the keys in grouping order, so someone could iterate over them that way? JS actually does sort of support ordering of properties in objects, but I'd like to not depend on that, as you can lose the orderings in different ways. I'd like to see some mustache template examples accessing these fields.

I think the relevance of the mustache fields is < the mappings; the mappings are hugely important, the mustache fields - we can improve over time, or probably find potentially verbose solutions to whatever shape they are given. But still something to think about.

@ymao1
Copy link
Contributor

ymao1 commented Nov 6, 2024

what is the expected cardinality of the fields?
how are we currently dealing with the max # of fields index.mapping.total_fields.limit, and how does that change?
how are we currently dealing with the max # of dynamic fields > index.mapping.total_fields.ignore_dynamic_beyond_limit, and how does that change?
how do we find out when we hit a limit?
what do we do when we hit a limit?
what happens when a limit is hit, and causes an SDH, and how do we figure out that's the problem?

I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit).

However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here.

@pmuellr
Copy link
Member

pmuellr commented Nov 6, 2024

I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit).

Ya, I guess we will need some estimate on the cardinality, and then increase the current max we have by that amount.

However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here.

I believe the coorelation is how many fields they group by, over all rules in a single "index". So, if they only ever grouped by the same 3 fields over all their rules, there would be 3 new fields.

@dgieselaar
Copy link
Member

dgieselaar commented Nov 7, 2024

@ymao1 @pmuellr if we run into the field limit we have #168497. That issue has been open for a year, and I'd like to use any objections around this as a forcing function to actually get that ticket done - the value of getting that over the finish line seems immense and immediately makes this discussion much simpler. I remember I spent a lot of time during RAC arguing about why statically mapping ECS fields in all AAD indices is a bad idea. The conclusion back then was that only Security AAD indices would do this so I'm not sure why this has now been applied to all AAD indices, including the Observability ones. I am pushing on this because it does not make sense to me to have a discussion about adding a few dozen of fields (tops) when we have the opportunity to cut back the amount of mapped fields by two orders of magnitude.

@pmuellr your questions make sense, however, they are problems that exist today. I don't expect this feature to materially change the amount of mapped fields. For reference, the cardinality of kibana.alert.group.field in the overview cluster is 19, for 1 million alerts.

@ymao1
Copy link
Contributor

ymao1 commented Nov 7, 2024

@dgieselaar I'll move that issue back into triage and we'll see if we can prioritize it.

@andrewvc
Copy link
Contributor

andrewvc commented Nov 8, 2024

Apologies for missing the previous pings. I'm +1 on reducing the current field usage along the lines @dgieselaar proposes and also on the dynamic mappings. I think it's the best balance of flexibility and performance.

@pmuellr @ymao1 do we have any rough estimates in terms of effort / time to deliver here?

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Nov 8, 2024

Hi everyone,

I created a PoC to test dynamic mapping and did some tests and here are my findings:

  1. In case of hitting the limit, if we don't have index.mapping.total_fields.ignore_dynamic_beyond_limit enabled (PR), we will get the following error, and that alert will not be reported (the rule execution is successful):
    Error writing alerts for observability.rules.custom_threshold:53301b6a-5bb6-42ee-a8ba-ac3d352dbaf7 'Custom threshold'
    
    If we enable index.mapping.total_fields.ignore_dynamic_beyond_limit, then the extra fields will be saved, but they will not be mapped, and we will see these fields in the _ignored field as shown below:
  2. We can enable dynamic mapping for the string type, but apparently, other types, such as long/double, are cast to a string, if possible. From the UI, we don't allow selecting those fields, but if someone provides them via API, we will generate alerts based on those, and a dynamic mapping will be saved for them.
  3. When we hit the limit, we can increase the mapping limit, but we need to help the user investigate the issue and possibly get rid of the unused mappings if that is the case.
    • If we increase the mapping limit, then the existing alert documents will be updated in the next execution and _ignore fields will be removed.
    • If we roll over the alert index, then all the dynamic mappings will be removed, and the new index will start with a fresh mapping. However, when using the alert search bar, we will see mappings from all the indices behind the alert index alias, so to remove those mappings, we need to somehow get rid of the old index, not sure how. (Maybe something to consider in the alert Archiving / Deletion strategy) (Thanks, @P1llus, for pointing this out.)
      I haven't fully tested this approach yet.
  4. During a discussion with @P1llus, we also discussed how we can surface _ignored issues to the user and possibly help them with some instructions on how to debug this. But if we solve the ECS static mapping, maybe we wouldn't need to focus on this as the chance of users hitting the limit seems small.
  5. It is worth mentioning that in the case of ECS group by fields, these fields will be mapped twice (once at the root level and once in the grouping field), but given our expectation of having a few numbers of the group by fields, mapping them twice should be fine.
  6. I don't know the possibility of this issue, but if we have a mapping available in an old index (alert-index-1) and not have this mapping available in a new index (alert-index-2), then in the alert search box, it is possible to search based on that field but the new alerts will not be filtered as expected.

In general, I see the main issue with hitting the limit is not being able to search the new fields, but the alerts will still be generated, and we can see the data in the alert flyout (if we enable index.mapping.total_fields.ignore_dynamic_beyond_limit), so maybe we can consider moving forward with dynamic groups even before having dynamic ECS fields mapping, especially considering the sample data provided above:

  • In a sample mentioned here, we still have 377 (2500 - 2123) field mapping available
  • Cardinality of kibana.alert.group.field is 19 in the overview cluster (comment)
  • I also checked the overview cluster, and here are the number of fields for different alert indices (I used /_field_caps?fields=* for this purpose)
    Index Number of field mappings Number of available mappings
    .alerts-observability.metrics* 2121 379
    .alerts-observability.threshold* 2091 409
    .alerts-observability.logs* 2121 379

@pmuellr
Copy link
Member

pmuellr commented Nov 11, 2024

maybe we can consider moving forward with dynamic groups even before having dynamic ECS fields mapping

We discussed this in a ResponseOps call last week, and concur. They are similar but different, and I think we will get a bit more experience with the dyanmic fields by starting with just the dynamic groups.

Seems like we will want to add something to the framework alert writer, to catch the _ignored fields (don't think we do today), and surface them somehow.

@mikecote
Copy link
Contributor

I'm catching up with the issue and one question I don't see asked is why we are not using the alert ECS mappings at the root level to accomplish this story? I'm sure there's a reason for it but it's not clear to me after reading the use cases on the GitHub issue..

The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?

  • Auto-suggestion on KQL bar should suggest this field

Maybe is this requirement that needs something special. Is there something where we only want auto complete on the group by fields?


We structured the alert documents in a way that this data can be surfaced at the root and then leverage this structure for maintenance windows and conditional actions. It would feel inconsistent if we provided multiple sources to accomplish the same thing.

@maryam-saeidi
Copy link
Member

why we are not using the alert ECS mappings at the root level to accomplish this story?
The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?

Yes, that is correct; we already saved the group by keyword ECS fields at the root level. The issue is related to handling groups that are not ECS fields. For example, in Otel data, we have k8s.cluster.name instead of orchestrator.cluster.name.

@mikecote
Copy link
Contributor

Thanks @maryam-saeidi. I discussed with the team, and we feel comfortable if we find a way to implement this change for kibana.alert.group within the following constraints:

  • Have a guardrail to prevent the number of fields and mappings stored within kibana.alert.group to go beyond, say, 25.
  • Have all fields mapped as keyword to prevent type mismatches
  • Doesn't fail rule execution when guardrail kicks in, perhaps puts the rule in a warning state instead

If you're good within those constraints, we'd be happy to have you or someone else prototype this for the team to review.

@andrewvc
Copy link
Contributor

I'm +1 on what @mikecote proposes, I'm curious about how we'd enforce the guardrail, if we have a place where we can easily do that validation that's great, I'm just curious where it'd go.

@dgieselaar
Copy link
Member

@mikecote FWIW, Mary already put up a POC here: #199298. Is your ask to include some kind of guardrails in that POC?

FWIW, in SLOs the fields are called slo.groupings.*. I like singular better than plural (which is how we commonly name fields) but maybe it's more important to be consistent with the SLO field names in this case?

@maryam-saeidi
Copy link
Member

Echoing what Dario mentioned above, I did a PoC, and I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?

FWIW, in SLOs the fields are called slo.groupings.*. I like singular better than plural (which is how we commonly name fields) but maybe it's more important to be consistent with the SLO field names in this case?

Good point! If we have a similar definition of saving group information in slo.groupings, using the same name might not be a bad idea.

@mikecote
Copy link
Contributor

I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?

I don't have an idea at this time. I think that would be the last piece left for the PoC, to find a way to guardrail so we guarantee only a limited number of fields get mapped. Is that something you could take time to research? Would be curious to see what options exist on how this could be done in Kibana side or Elasticsearch side?

@maryam-saeidi
Copy link
Member

Regarding adding a guardrail, I checked with the ES team and didn't find any option. I created a ticket to request adding this feature: elastic/elasticsearch#118223
@dgieselaar @andrewvc How can we discuss this feature request with the ES team for prioritization?

@dgieselaar
Copy link
Member

@maryam-saeidi there's a similar issue: elastic/elasticsearch#113275. I would suggest to talk to Felix/Joe about what the state is of that one, but at least it seems like it's the same issue, but perhaps with a different suggested solution. To clarify, what you're asking for is a limit of dynamically mapped fields per object, e.g. kibana.alert.groups should not create more than n dynamically mapped fields?

@maryam-saeidi
Copy link
Member

@dgieselaar Yes, exactly. I will check the ticket you shared and talk to Felix/Joe about it.

@dgieselaar
Copy link
Member

@maryam-saeidi We should also consider the anomaly detection rule type. I'm not sure if that uses grouping, but the partition and by fields should be stored in kibana.alert.group as well so we can also match anomaly detection alerts to streams/entities.

@dgieselaar
Copy link
Member

@maryam-saeidi another question: why do we not show the index threshold rule type? it also supports grouping keys no? (I will admit that I don't really understand the difference between the ES Query rule type and the Index Threshold rule type)

@maryam-saeidi
Copy link
Member

We should also consider the anomaly detection rule type. I'm not sure if that uses grouping, but the partition and by fields should be stored in kibana.alert.group as well so we can also match anomaly detection alerts to streams/entities.

I think the main point of this ticket is to come up with a solution, and we can incrementally add it to all the observability rules that we see fit. So, I would focus on finding the solution first and adding it to a rule if we see it necessary should not be an issue.

why do we not show the index threshold rule type? it also supports grouping keys no? (I will admit that I don't really understand the difference between the ES Query rule type and the Index Threshold rule type)

That's another big question we can discuss separately, and Jason and I are aware of this challenge. (The topic of whether the distinction between stack and observability alerting rules makes sense).

@dgieselaar
Copy link
Member

@maryam-saeidi Let's just add them to the list here, or do you see a good reason not to? The reason I brought that up is because it initially listed only Observability rule types and it led to confusion w/ the Response Ops team.

@maryam-saeidi
Copy link
Member

Let's just add them to the list here, or do you see a good reason not to?

Adding to the list is not an issue, but we need to check which fields we want to bring and what the implications are (in terms of having them under the grouping fields when a rule has a different field for it). To be clear, I am not opposed to it; I am just saying it needs a bit more discussion, and I would rather have a dedicated ticket for it.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Jan 28, 2025

Summary of the discussion:

We decided to move forward with this approach considering the following:

  • Set the index.mapping.total_fields.ignore_dynamic_beyond_limit to true.

  • Considering an approach in handling the edge case of reaching the mapping limit

    • Auto-increase mapping
      It seems a low-effort option for the ResponseOps team. @mikecote will create a ticket, and they will get back to us within a month of the result of their investigation to see if this option is feasible. If not, we will still have time to consider other options for the next version.

    Other alternatives

    • Rolling over the alerting index
    • Ask for user manual intervention

We’d also like to see telemetry collected on this to better understand the cardinality we’re dealing with, which will help us determine if/when we should implement a circuit breaker. (We can possibly implement this logic sooner if this ES feature request gets prioritized.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

9 participants