-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus: fix offline worker metrics #1135
base: master
Are you sure you want to change the base?
Prometheus: fix offline worker metrics #1135
Conversation
… transient prometheus labels. This is just a sketch. Tested positive that a label is removed and graph eventually disappears from prometheus (after old data is no longer in range). Requires moving commond logic into a separate EventsService.
…hboard. Moved code to generate current state dashboard into EventsState object. Start using static typing in the files.
Moved decision making and filtering completely to EventsState object. Now we need one call to get current, online workers data.
The amount of code for the metrics and its place inside of EventsState object was wrong making testing very hard. Moved all code related to the metrics outside of events.py file into prometheus_metrics.py. Add methods to extract label sets that belong to offline workers and remove them from metric. No matter how hard I tried not to access any private attribute of the metric itself I ended up needing it somewhere. For now I am just using metric._labelnames as it is highly unlikely that it will change.
If we have offline workers we will remove metrics for them.
Now metrics related code is in prometheus_metrics.py. The EventState object is responsible for gathering offline/online workers data and filtering it for the dashboard view and also delegates the removal of metrics for offline workers to the PrometheusMetrics object.
if worker is None or worker not in offline_workers: | ||
continue | ||
|
||
label_sets_for_offline_workers.add(tuple([labels[label_name] for label_name in metric._labelnames])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to generate label values in correct order as in the ._labelnames, otherwise we will not remove it from the metric (KeyError will be raised which we catch and pass). That's why I access this private attribute here. It is unlikely they will change implementation and remove this attribute... I think it is small enough smell to let it exist for now.
If I find the time, I would like to make it into a read-only property of a metric in prometheus_client project itself - then I will amend this here or even better add a method in prometheus_client to get current label sets as well...
offline_workers=offline_workers | ||
) | ||
|
||
def get_online_workers(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably notice I started introducing typing into the project.
The benefits are:
- any modern IDE gives you nice autocompletion
- easier reading/understanding inputs and outputs especially between multiple methods/objects
try: | ||
metric.remove(*label_set) | ||
logger.debug('Removed label set: %s for metric %s', label_set, metric) | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the metric could already be removed in another thread, it's best to just pass
@mher I pushed more code - please have a quick look and let me know if this solution is acceptable, then I will add all the missing tests. |
for sampled_metric in sampled_metrics: | ||
for sample in sampled_metric.samples: | ||
labels = sample.labels | ||
worker = labels.get(LabelNames.WORKER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use enums instead of strings to avoid typo problems hence I used LabelNames enum in metric definitions and use it here to make sure we access the correct one. If you think it's too much....we can always go back to strings.
We are really interested only in getting online workers or a list of offline ones externally. Moved methods around to allow easier reading (tried to keep them in order of calls).
…for_offline_workers.
) | ||
|
||
@property | ||
def transient_metrics(self) -> List[MetricWrapperBase]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think that this should be configurable, maybe we could make a second PR and add it as an additional feature. For now I think it makes total sense just to wipe any existing records from metrics for label sets containing offline workers. Otherwise the graphs are polluted with a reading from like a few days ago...
def _get_label_sets_for_offline_workers( | ||
metric: MetricWrapperBase, offline_workers: Set[str] | ||
) -> Set[Tuple[str, ...]]: | ||
sampled_metrics = metric.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.collect()
gets a Metric object with samples. The samples have a labels dictionary representing the label name to label value. We are interested in getting those values and checking if the one under key 'worker' is in our offline_workers set.
Not the prettiest way of getting this data but collect
is the only common, public method of all the metric types we use that provides any info like this...
I hope I can at some point bake it into the metrics themselves.
Mention that prometheus metrics are also removed based on this setting.
@mher Can please provide details when this will be added? |
@Tomasz-Kluczkowski can you update your branch? then at least it will be possible for people to run (and test) your fork until this finally gets merged. this not being fixed yet is a real pity. would love to see this move to the main branch and get properly released. |
@mher A fully working (tested with Prometheus/Grafana) code now pushed. All metrics for offline workers are removed and grafana eventually runs out of data for its graphs and removes all the plots for non existent workers.
Tests added.
My own comments added for guidance for the reviewer.
Please let me know if this works.
Would be nice if someone actually run this and observe graphs in grafana....
Log when running this after celery worker is stopped: