-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus stats "stuck" on last value seen for transforms using aggregations (vector_utilization) #20216
Comments
Thanks for filing this @johnhtodd ! We believe this is because the |
Perhaps a timer of "last observed event" that gets checked and if the timer is >1 second or some other interval then the value is set to zero. I don't know how the value is calculated, so that may be too primitive, but it seems like a quick fix. That also implies that there is some housekeeping loop that even will be able to check the timer - if the values are set only upon an event passing through the component, then this might be more complex to solve. |
Agreed, I think the fix could look something like that. If anyone is interested in taking a shot at this, the fix would likely be somewhere in https://github.com/vectordotdev/vector/blob/7b85728c474abc2ff691624ac253ff1777d450b7/src/utilization.rs |
I am not 100% sure, but I think the second is true - I don't think anything happens unless an event is passing, so this might be a bit more complicated then just making a quick fix in |
This may be a bad idea, so no problem if it is dismissed: How about at every Prometheus scrape, a timestamp is written to each metric. If the metric is updated by an event, that timestamp is erased. If Prometheus checks back on the next scrape, and the timestamp still exists, then Prometheus zeros the metric out or indicates that the metric is "stale" because there have been no "natural" updates? This would mean that a utilization metric would only get "stuck" for one prometheus scrape interval. |
…e it is regularly published This adds a separate task that runs periodically to emit utilization metrics and collect messages from components that need their utilization metrics calculated. This ensures that utilization metric is published even when no events are running through a component. Fixes: vectordotdev#20216
A note for the community
Problem
It appears that using aggregations creates conditions where vector_utilization Prometheus metrics become "stuck" on the last value seen in a burst of events, even though there are no current events in the event queue. Also, the last value that is shown seems to be unexpectedly high, causing significant monitoring inaccuracies. All transforms downstream from the aggregation seem to show this similar 'stuck' problem.
Configuration
Version
vector 0.37.0 (x86_64-unknown-linux-gnu 4279bf0 2024-03-26 04:01:58.800851552)
Debug Output
No response
Example Data
No response
Additional Context
We perform intermittent metric_to_log executions, with many minutes (or sometimes hours) in between flushes of the aggregated metrics that are then transformed into log outputs. Some of these logs are then sent through other transforms before finally being sent to a sink.
We track the usage of all activity with the Prometheus metric of "vector_utilization".
We see that during activity (once ever X hours) the vector_utilization values for the various metric_to_log transforms and downstream transforms shows what we would expect as "normal" data, but then just as the final data is pushed out to the sink, the utilization number for each of these metric_to_log and remap components rises dramatically (~.67 but that is variable) and then stays there until the next cycle. It is as if there is a spike in the utilization on the very last event processed through the queue, and then that last value is "remembered" until some next execution. These stuck values all seem to share one component: they are aggregations, with delays in between bursts of events that are sent downstream.
This really throws off our understanding of what is happening with Vector, as it creates several "ghost" components with exceptionally high utilization, which are not actually happening. Or are they? The utilization number stays exactly the same until the next execution, which tells me that these are almost certainly artifacts and not real. However, when we try to do things like add up all the vector_utilization figures on a system to understand behavior, the artifacts make that total entirely impossible to understand.
I enclose a Grafana graph of two lines - the top purple line shows a metric_to_log that is a consumer of an aggregation, and the bottom red line is a remap transform that is a consumer of the metric_to_log. From 15:02:00 to 15:02:30, there was a "flush" of aggregation messages through the pipeline, and we see the dip in the vector_utilization numbers for both remap stanzas. However, then at 15:02:45, the queue was completed, and the last event causes the vector_utilization value for each of the two remaps to jump back up to around .69, and then those number stay exactly consistent until the next flush of aggregation events some hours later.
References
No response
The text was updated successfully, but these errors were encountered: