Add cache limits for resources and attributes #509

srikanthccv · 2025-01-16T09:06:52Z

The main objective of this change was to prevent the rouge datasets from adversely affecting others. The shortcoming of a single set of resource + attribute fingerprints is that all resources are treated equally, when in practice, some resources are more cardinal than others, so Instead of maintaining a single set for a combination of resource fingerprint + attrs fingerprint, we now maintain one key for each resource fingerprint, which maintains the attribute fingerprints set. To prevent the key explosion, a separate set for tracking the number of unique resource fingerprints (configured with max_resources) for each data source is maintained (Some users add the timestamp of the log records as resource attributes; we don't want to accept such data as part of this).

The (configurable) limits are that there can only be a maximum of 8192 resources for a data source for the current window, and each data source can have a maximum of 2048 unique attribute fingerprints. Since any data can go into attributes, we want to limit the attribute's fingerprints as well. We have several filtering by layer that is intended to filter out all high cardinal values before they reach the fingerprint creation.

We filter out the attributes that have more than X distinct values seen
We run a goroutine that fetches the distinct count of attribute values for each attribute (because DB would have the complete data and pre-filters them.

This greatly reduces the unique fingerprint however, even if the distinct values are not higher, there can be some attributes that have 10-20 values, their combinations can result in a high number of unique attribute fingerprints, so we want to limit them as well. This is the max_cardinality_per_resource. Even if the combinations are below, the number of resources X Number of attributes can be ~17mil, which we don't want to allow, so there is a total maximum cardinality allowed for each data source in the current window, configured with max_total_cardinality . All of these settings have some defaults; they are based on our observations in monitoring our system and choosing sensible defaults. We may tweak some of them as we learn more.

This reverts commit e7da9c0.

grandwizard28 · 2025-01-16T20:01:40Z

Here's what's happening:

We are using one set of the form %s:metadata:%s:%d:resources per signal, per tenant to store unique fingerprints of resouces. If this exceeds max_resources (8192), we don't proceed.

We are using sets of the form %s:metadata:%s:%d:resources:<fingerprint> per resource, per signal, per tenant which can be total of max_resources (8192). The max number of values that one such set can have is max_cardinality_per_resource (2048).

So, for one tenant and one signal, the max number of sets will be 8192 + 1.
For one tenant across 3 signals, that is 8193 * 3 = 24,579 sets.

For each payload we doing the following:

SCARD(%s:metadata:%s:%d:resources)

WHILE cursor DO
  cursor:= SCAN(%s:metadata:%s:%d:resources) --> expensive
  SCARD(cursor)
DONE

FOR resource, attributes DO
   attributes_diff = SMISMEMBER(%s:metadata:%s:%d:resources:<fingerprint>, attributes)
DONE

FOR resource, attribute_diff DO
     SCARD(%s:metadata:%s:%d:resources)
     SADD(resource)
     SCARD(%s:metadata:%s:%d:resources:<fingerprint>)
     SADD(attribute_diff)
DONE

I propose the following structure:

Use 1 redis hash called %s:metadata:%s:%d:resources to store <resource-fingerprint>:<num-attributes>.
Use 1 single set called %s:metadata:%s:%d:resources:attributes per signal. Use <resource-fingerprint>:<attribute-fingerprint> as the values

So for each payload, we have

FOR resource, attributes DO
   fingerprints = <resource-fingerprint>:<attribute-fingerprint>
DONE

HLEN(%s:metadata:%s:%d:resources) --> max_resouces_check

HMGET(%s:metadata:%s:%d:resources) --> per_attribute_check

diff = SADD(%s:metadata:%s:%d:resources:attributes, fingerprints)

[Result of SADD and HMGET will give the updated values]

HMSET(%s:metadata:%s:%d:resources,diff)

Less number of calls to redis. 4 calls per batch.
Less number of keys created. (2 per tenant per signal).
Redis hashes are fast and memory efficient.

Bonus points if this can be a lua script.

srikanthccv · 2025-01-18T13:03:29Z

What is happening with the metrics? Several reasons: 1. We have pre-filtering implemented for logs and traces with the help of a separate tag_attributes_v2 table, which is not yet populated in metrics (coming with metrics explorer), so all attributes are ingested even though they are not useful 2. Prometheus scraped metrics that add net.host.name as resource attributes, those that don't have vs those that have net.host.name are two treated two different resources etc.

I spent some time reading the memory overhead associated with different choices in Redis. Redis maintains a dictionary of keys, and each key entry adds an overhead of 50-70 bytes other than its key string length. When the requirement is a bunch of k1:v1 mapping, hashes shine if keys can sharded to fit in the hash-max-ziplist-entries. We are better off HLL for each signal for total cardinality and total resources than maintaining counters ourselves. Now, the remaining requirements are membership checks and per-resource limits. We have two options at hand: 1. One global set with resourceFp:attributeFps for each signal and 2. Per-resource set with `attribute Fp's

Comparison b/w one set vs per-resource sets

One set of resourceFp:attributeFp for signal:

One key (e.g. "...:resources:attributes")
Up to 3 million entries total

Memory for each entry

Each entry is a string combining two 64‐bit integers in decimal plus a colon
Length up to 41 characters (20 digits + colon + 20 digits) → ~40 bytes of actual string data +
typically ~50–70 bytes per entry for the internal hashtable
So each entry could be ~90–110 bytes total (string + overhead)

For 3 million entries, at ~100 bytes each = 300 MB

Separate Set per Resource:

In the worst case, with 8192 sets one for each resource,
Up to 3 million entries total

Memory for each entry

Each set member is attributeFp (a 64‐bit integer), ~8–16 bytes
typically ~50–70 bytes per entry for the internal hashtable
So each entry could be ~60–90 bytes total (string + overhead)

Overhead of Multiple Keys

Up to 8192 keys
typically ~50–70 bytes per entry for the internal hashtable
8192 * 60 = 0.5MB
- length of keys => 8192 * 100 = 0.7MB
Total overhead from keys for the signal is 0.5 + 0.7 = 1.2MB

Total memory usage: 214 (for 3 mil entries at an average of 70 bytes value) + 1.2 = 215MB

In conclusion, the per-resource set has less memory as we don't repeat the fingerprint for each value in the set. I believe we should go with the per-resource. I am not expecting any current user to come close to any of the limits (Will probably revisit this for a metrics name/type-based scheme to handle histograms better)

srikanthccv added 5 commits January 16, 2025 12:32

Add cache limits for resources and attributes

d9a651d

Add duration checks

ee51a19

Add log when limits exceed

e7da9c0

Revert "Add log when limits exceed"

67f9f5d

This reverts commit e7da9c0.

Adjust logs

9cfd120

srikanthccv requested a review from grandwizard28 January 16, 2025 12:16

srikanthccv marked this pull request as ready for review January 16, 2025 12:16

srikanthccv added 3 commits January 16, 2025 20:23

More pipeline and better count

6cc8932

Remove duplicates

a4dd4ee

Disable debug by default

2889b38

Fix tests

997b8ad

grandwizard28 approved these changes Jan 18, 2025

View reviewed changes

Merge branch 'main' into cache-limits

d019553

srikanthccv merged commit 6ef8aad into main Jan 18, 2025
4 checks passed

srikanthccv deleted the cache-limits branch January 18, 2025 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache limits for resources and attributes #509

Add cache limits for resources and attributes #509

srikanthccv commented Jan 16, 2025

grandwizard28 commented Jan 16, 2025 •

edited

Loading

srikanthccv commented Jan 18, 2025

Add cache limits for resources and attributes #509

Add cache limits for resources and attributes #509

Conversation

srikanthccv commented Jan 16, 2025

grandwizard28 commented Jan 16, 2025 • edited Loading

srikanthccv commented Jan 18, 2025

grandwizard28 commented Jan 16, 2025 •

edited

Loading