Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dmagliola
Copy link
Collaborator

This doesn't work!!

I'm just leaving it here as documentation in case someone tries to do this to solve #143
Read the commit comments for more.

NOTE: I still think this is the right thing to do (one file per process, rather than per metric/process)
It'll have more contention in multi-threaded scenarios, but it'll drastically reduce the number of files

NOTE2: We also need to add a Mutex around that shared Dict, if we're having the shared Dict! The @lock in in_process_sync needs to be shared by all MetricStores

Anyway, i'm back to thinking we should either make a second DirectFileStore that does this (and have two separate files, instead of a param), or considerably refactor the DirectFileStore. My quick stab definitely doesn't work

Daniel Magliola added 2 commits October 14, 2019 12:33
This was a quick experiment on having all metrics for each process on
the same file, because it seemed like it would be easy. (just do these
things in this commit)

However, even though tests pass, this doesn't actually work. See next commit on why.
If the change were just this, i'd say we should do this.

However, as you'll see in the next commit, it's more involved than that,
and i'm not sure it's worth doing, at least not with this approach...

I'm just pushing this up so it doesn't get lost.
So, again, I don't propose we actually do this, this is just illustrative
of what we need to do, and didn't want it to get lost.

The problem with the previous commit is that for each metric, we open a
new FileMappedDict. All of them sharing the same file is "fine", actually.
The problem comes from the fact that each instance of FileMappedDict has
their own version of `@used`. When one metric adds a new entry to the file,
it moves its own `@used` pointed, but the other instances don't notice this.

As soon as a second instance that had already been loaded adds a new labelset,
it corrupts the file.

This only shows up when running the benchmarks, with histograms. Counters
(in our benchmark) don't reproduce it, because each metric in our benchmark
only has one labelset. They all get added initially to the file, and an
already loded Dict doesn't add new ones. The reason Histograms reproduce it
is that new observations may report a bucket that wasn't seen before, which
is effectively a new labelset.

The solution is for all MetricStore instances to share one single FileMappedDict.
This is not particularly hard, but the current MetricStore is not written
in a way that allows that without doing this crap.

We should find a better way. I'm just leaving this here as documentation
of this problem, for the next brave soul attempting this
@coveralls
Copy link

Coverage Status

Coverage decreased (-16.0%) to 84.042% when pulling 149fc3c on all_metrics_one_file into af29066 on master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-16.0%) to 84.042% when pulling 149fc3c on all_metrics_one_file into af29066 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-16.0%) to 84.042% when pulling 149fc3c on all_metrics_one_file into af29066 on master.

@saviogl
Copy link

saviogl commented Mar 24, 2021

@dmagliola Do you know if there's a working model of this proposed approach somewhere, I'd like to give it a try if possible? We are currently running into this issue due to the use of unicorn_worker_killer and the number of files that end up getting created whenever a new worker process gets spun up. Over time the /metrics endpoint becomes very slow and memory usage increases unboundedly.

@dmagliola
Copy link
Collaborator Author

Hi @saviogl, there isn't sadly one.
This is next on my "things to fix" list, and i'm hoping to get to it "in the next couple of weeks", but I don't want to under-deliver on that promise.

The way to try to solve this is to declare your own class that exposes the FileStore interface, and you'd point to that when declaring metrics.

The key thing to keep in mind here is that, while you'll have a lot of MetricStores, they all need to point to the same, shared instance of FileMappedDict, which you'd define at the FileStore level, not in the MetricStores as we do in DirectFileStore.

It's a non-trivial change, but that's basically what needs to happen.

The challenge for me is I wanted to offer the option: you can have file per process, or file per process-metric.
Unfortunately, you can't do that with a flag as this PR proposes (or not without making the code a mess), so we'll probably have to either choose one or the other to offer, or offer two separate DataStores to pick from, which have a lot of code duplication.

I'm sure there's a better way i'm missing, but anyway, that's my thinking on this.

@saviogl
Copy link

saviogl commented Mar 24, 2021

@dmagliola Just trying to understand better whether or not this will actually solve our issue. While the amount of files is a problem to be had (maybe soon), the biggest issue right now is around memory consumption due to the fact we are "constantly" reaping unicorn worker processes for graceful memory management and therefore creating new "metrics" per processes. Moving to "One file per process instead of per metric/process for DirectFileStore" would tame the number of files created but the amount of data (metrics per process) would remain the same "correct"? And therefore we would still have an unbounded growth in regards to metrics -> memory for exporting those?

Is my assumption correct?

If I'm understanding the current model correctly, the solution for the memory unbounded growth would be to have processes to write the same file/metric with some process concurrent control system in place right?

@Sinjo
Copy link
Member

Sinjo commented Mar 25, 2021

While the amount of files is a problem to be had (maybe soon), the biggest issue right now is around memory consumption due to the fact we are "constantly" reaping unicorn worker processes for graceful memory management and therefore creating new "metrics" per processes.

If I've understood you right this comment in #214 is the situation running into, and you're generating a tonne of metric series by churning through unicorn workers. Whatever we come up with in there won't help you when it comes to scraping metrics becoming slow because of a large number of files, but might help you reduce the number of metrics (and ultimately timeseries in your Prometheus instance) produced.

@wizardmatas
Copy link

FYI: on GitLab similar implementation is used: https://gitlab.com/gitlab-org/prometheus-client-mmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants