-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow response on metrics http endpoint #194
Comments
Hello Tim, The only case where we've heard of those kinds of timings is when having a very large amount of metrics (tens of thousands), using the DirectFileStore. The output you report in the Foreman issue has over 3000 lines, so it does look like you're getting near the point where that starts being an issue. The suggestion Lukas gave you on that ticket makes sense to me, as having fewer buckets for histograms should result in fewer metrics.
This would help us diagnose if there's some quick wins. However, the sad news is that as the library stands, you may not be able to have as large a number of metrics as you currently have, or you may need to make adjustments to allow for really long render times of the /metrics endpoint (such as reducing the scraping frequency from Prometheus). It saddens me to say so, but unfortunately this is as fast as we've been able to make the library at the moment. We would like to make it faster, of course, but we don't currently have any clear plans on how we could achieve a significant difference (we're talking an order of magnitude faster here to make a real difference). It may be necessary, if there are no other choices, to reduce the number of metrics you are reporting on. The easiest way to do this would be to reduce the cardinality of your labels. For example, if you report on the controller but not the action, that could reduce the number somewhat. Or, alternatively, reduce the number of metrics. If there are some of those you are not currently using in your dashboards / alerts / etc, you could remove them. I'm aware this is not a very happy suggestion, but i'm afraid at the moment I don't have a better answer. I'm also aware that you may or may not be able to modify how these metrics are reported. (I'm not familiar with Foreman, sorry). This may require modification to Foreman to make the metrics it reports more customizable, if it's not currently possible. With the information at hand, this is sadly the best answer I have. Maybe if we could get some more information, like a profile of what's happening when the metrics get scraped, we might notice some anomaly based on that. This again may be something that The Foreman project may need to help you with. Let me know the answers to the questions above, and hopefully we can find a way of making this faster for you! |
One more thing I remembered. When you have lots of metrics, you will end up with lots of files on disk. (Lots of metric names, that is, this isn't affected by labels and their cardinality). I don't know for sure, but I think this doesn't help with the export performance, and I want to have an implementation of a store that has a single file per process, rather than one per process/metric. I expect this would perform less well when observing samples, but better when exporting. PR #161 was a first attempt at this, trying to add the functionality to the existing store, which had a lot of problems documented in there. We probably need to make a separate store, which feels a bit less clean, or refactor this one heavily. If anyone is interested in contributing to this effort, it'd be highly appreciated. It's still in our list of things to do, but we haven't had time yet. |
Hi @dmagliola, DEFAULT_BUCKETS = [1, 5, 20, 50, 100, 250, 500, 2000].freeze which resulted in: real 0m46.964s
user 0m0.005s
sys 0m0.015s real 0m48.860s
user 0m0.002s
sys 0m0.016s real 0m53.635s
user 0m0.004s
sys 0m0.019s I again changed the bucket to: DEFAULT_BUCKETS = [100, 500, 2000].freeze and restarted foreman. Now three curls again: real 1m7.207s
user 0m0.002s
sys 0m0.013s real 0m56.515s
user 0m0.003s
sys 0m0.014s real 0m57.622s
user 0m0.004s
sys 0m0.011s during the /metrics call I see that one to three puma threads run at 100% cpu time, but the machine has enough free cores, memory and IO. Sadly I don't have a dedicated test environment so there is always some production load while I test. I use the default prometheus backend, currently with prometheus 2.19.0. But scraping for foreman is currently not enabled, I just tried it with curl. Since the response takes so long I wanted to be sure that scraping it doesn't decrease the overall application performance. I know that puma has some tuning options for processes/threads, but I'm no Ruby/Puma/Rack expert and can't really tell what's the best way to go here. the default foreman installation provides these settings to puma: Environment=FOREMAN_PUMA_THREADS_MIN=0
Environment=FOREMAN_PUMA_THREADS_MAX=16
Environment=FOREMAN_PUMA_WORKERS=2 That didn't work for my workload and I changed it to: Environment=FOREMAN_PUMA_THREADS_MIN=0
Environment=FOREMAN_PUMA_THREADS_MAX=24
Environment=FOREMAN_PUMA_WORKERS=4 I'm happy to change this to whatever you suggest. I know that some of the reported metrics aren't required for my personal usecase and I can live without them. I'm not sure if foreman has an option to disable/ filter them. I will investigate that and report back as soon as I've more information. Please let me know if you've any questions. suggestions or need more data from me. |
Would you be able to look at the text output of My working hypothesis here is that reducing the number of buckets should significantly reduce the number of time series. Given my previous expectation that rendering time is somewhat proportional to number of time series, I would've expected a measurable drop in rendering time. The rendering time didn't drop, which makes me think of 2 obvious possibilities:
I want to make sure it's not #2 happening.
yes, I would expect this. While we're reading from files, I would expect that most of the time is spent mutating the hashes read from the files to aggregate all the processes, and/or turning those resulting hashes into the text output. We've put most of our performance effort in observing samples real fast, not on exporting them, so I don't have actual evidence that that is true, I haven't measured it... So this may be very wrong, but that's my expectation given what the code needs to do.
Prometheus by default uses the Synchronized store, which is in RAM and does not support multi-processing. As it stands, I don't have any obvious recommendations. The next step to move this forward would be to replicate a use case with those same metrics / label counts, with files from 4 processes, run it in development and profile it. And then start optimizing the hot paths. We are unlikely to have time to work on this "soon", although I would really like to. It would be amazing if someone has time to contribute to this effort. One final thought... You could (emphasis on could, not necessarily should) post-facto remove the metrics you don't want from the registry. The metrics would still exist, they'd still get incremented, but they wouldn't be exported. Now, this goes against the design of the system, and it could break something (even though I'm assuming it probably wouldn't). Tread carefully if doing that, and I don't recommend it as a permanent solution... But it could be a quick test of "does this get significantly faster if I remove the metrics I don't want". If the answer is yes, maybe this is something that could be changed in Foreman, letting you pick and choose. I'm not sure whether this is at all interesting to the Foreman team, and i'm aware I'm bouncing you back to them, when it's this piece of code that is slower than it should be. But sadly I don't have any quick wins. We would need to find a way to significantly speed up the Prometheus Client. I don't have any ideas currently of how we could do that without a good amount of time investment into it, and i'm not even sure the kind of order-of-magnitude improvement you'd need is possible given the constraints of the multi-process environment. |
Hello guys,
I can confirm that. What I am going to do is two changes:
Our telemetry approach was rather simple - I thought just throw all metrics at the user and let them to dig interesting data from it. Looks like we need to cherry pick a decent set. There is actually a way out - in Foreman you can also use statsd, then use prometheus statsd exporter which will load the data over UDP and export them over HTTP. It's written in Go, it should be pretty fast and also UDP transport actually solves the multi-process issue nicely. The packets stack up in kernel buffer until they are processed. |
thanks again for the detailed response! the amount of metrics indeed reduces with different bucket sizes. With: DEFAULT_BUCKETS = [1, 5, 20, 50, 100, 250, 500, 2000].freeze I get 3333 lines of output. With: DEFAULT_BUCKETS = [100, 500, 2000].freeze I get 2522 lines of output. I got a response in https://projects.theforeman.org/issues/30029#change-137579 about disabling metrics which seems to be less ugly/bad than your mentioned way. I will try that as the next step and see if I can reduce the amount of metrics |
This sounds great.
That sounds like the right approach. This is what i'd normally do initially... Until some limit is hit.
Ok. Thanks for confirming that. One more thought, which would almost certainly explain bad performance... Are you running this in an ephemeral filesystem (for example, in a container?), or are files persistent between runs of Foreman in production? The reason i'm asking is these metrics are being kept in files in A bit more information about this and how to fix it here: Check the "Resetting your metrics on each run" sub-heading. |
Good point, we have actually implemented metric label allow-list and we only allow the key labels now. Also, during startup we delete all |
For the record, after chat with @dmagliola it's clear what is the problem - multi-process servers (e.g. Puma/Passenger and others) do recycle worker processes quite often and although I thought it should be possible to simply delete those temporary files when process exit since I had an impression that the implementation holds the data in memory. However @dmagliola explained to me that this would ultimately lead to loss of information as data is not squashed or aggregated anywhere. Indeed, this would break counters which are expected to be monotonic. Unfortenately, any kind of data "squashing" will be challenging to implement. I suggest to keep this issue opened until this is resolved. I am going to link this issue from our docs warning users about this problem - we do have a second option how to export data via statsd (UDP packets) which can be used with |
Is this still relevant? I'm starting to implement Prometheus-based metrics at our site and my first tries didn't show this problem. Also, the PR is already merged. So, @bastelfreak is this still happening with the latest Foreman? (We're on 3.3.0) |
Hey, |
Oh, okay. Then I'll just have to try. Thanks. 😄 I will send you an update if everything goes smooth so you can close the issue. |
@dploeger any follow up on this? We're looking to spin up some Foreman metrics as well and would be grateful if I don't have to test this out 😅 |
@fbegyn Ha. Thanks for asking me that. I just realized, that prometheus wasn't enabled anymore in our environment. I'll enable it again and will let you know. |
I've re-enabled the check last week and can't see any problems. Will enable it on our prod system today. |
Hi people,
I use The Foreman in version 2.0.0. That version bundles the prometheus-client gem in version 1.0.0. A simple call to the /metrics endpoint takes 60 to 75 seconds. I raised in issue for that already at the foreman project: https://projects.theforeman.org/issues/30029
They advised me to create an issue here as well. I'm not really sure how I should debug this. The whole stack runs on a dedicated server with CentOS 7. I don't see any CPU/IO peaks.
I'm aware of the fact that those information aren't really useful to debug the issue. Can you please tell me which infos you need / how I should debug this? I'm happy to debug this for you or discuss it in any chat.
Cheers, Tim
The text was updated successfully, but these errors were encountered: