Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus: fix offline worker metrics #1135

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

Tomasz-Kluczkowski
Copy link
Contributor

@Tomasz-Kluczkowski Tomasz-Kluczkowski commented Aug 9, 2021

@mher A fully working (tested with Prometheus/Grafana) code now pushed. All metrics for offline workers are removed and grafana eventually runs out of data for its graphs and removes all the plots for non existent workers.

Tests added.
My own comments added for guidance for the reviewer.

Please let me know if this works.
Would be nice if someone actually run this and observe graphs in grafana....

Log when running this after celery worker is stopped:

[I 210815 23:22:43 web:2239] 200 GET /metrics (127.0.0.1) 3.92ms
[I 210815 23:22:58 web:2239] 200 GET /metrics (127.0.0.1) 3.41ms
[I 210815 23:23:12 web:2239] 200 GET /dashboard?json=1&_=1629066192482 (::1) 0.53ms
[I 210815 23:23:13 web:2239] 200 GET /metrics (127.0.0.1) 3.85ms
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-started', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-received', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-succeeded', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-failed', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'tasks.add') for metric histogram:flower_task_runtime_seconds
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'tasks.add') for metric gauge:flower_task_prefetch_time_seconds
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'tasks.add') for metric gauge:flower_worker_prefetched_tasks
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560',) for metric gauge:flower_worker_online
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560',) for metric gauge:flower_worker_number_of_currently_executing_tasks
[I 210815 23:23:28 web:2239] 200 GET /metrics (127.0.0.1) 6.30ms

… transient prometheus labels.

This is just a sketch.
Tested positive that a label is removed and graph eventually disappears from prometheus (after old data is no longer in range).

Requires moving commond logic into a separate EventsService.
…hboard.

Moved code to generate current state dashboard into EventsState object.
Start using static typing in the files.
Moved decision making and filtering completely to EventsState object.
Now we need one call to get current, online workers data.
The amount of code for the metrics and its place inside of EventsState object was wrong making testing very hard.
Moved all code related to the metrics outside of events.py file into prometheus_metrics.py.

Add methods to extract label sets that belong to offline workers and remove them from metric.
No matter how hard I tried not to access any private attribute of the metric itself I ended up needing it somewhere.
For now I am just using metric._labelnames as it is highly unlikely that it will change.
If we have offline workers we will remove metrics for them.
Now metrics related code is in prometheus_metrics.py.
The EventState object is responsible for gathering offline/online workers data and filtering it for the dashboard view and also
delegates the removal of metrics for offline workers to the PrometheusMetrics object.
Mock path needed changing after code responsible for filtering based on options was moved into the EventsState object.
if worker is None or worker not in offline_workers:
continue

label_sets_for_offline_workers.add(tuple([labels[label_name] for label_name in metric._labelnames]))
Copy link
Contributor Author

@Tomasz-Kluczkowski Tomasz-Kluczkowski Aug 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to generate label values in correct order as in the ._labelnames, otherwise we will not remove it from the metric (KeyError will be raised which we catch and pass). That's why I access this private attribute here. It is unlikely they will change implementation and remove this attribute... I think it is small enough smell to let it exist for now.

If I find the time, I would like to make it into a read-only property of a metric in prometheus_client project itself - then I will amend this here or even better add a method in prometheus_client to get current label sets as well...

offline_workers=offline_workers
)

def get_online_workers(self) -> Dict[str, Any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably notice I started introducing typing into the project.
The benefits are:

  • any modern IDE gives you nice autocompletion
  • easier reading/understanding inputs and outputs especially between multiple methods/objects

try:
metric.remove(*label_set)
logger.debug('Removed label set: %s for metric %s', label_set, metric)
except KeyError:
Copy link
Contributor Author

@Tomasz-Kluczkowski Tomasz-Kluczkowski Aug 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the metric could already be removed in another thread, it's best to just pass

@Tomasz-Kluczkowski
Copy link
Contributor Author

@mher I pushed more code - please have a quick look and let me know if this solution is acceptable, then I will add all the missing tests.
All was tested manually multiple times with actual Prometheus/Grafana/Celery/Flower running.

for sampled_metric in sampled_metrics:
for sample in sampled_metric.samples:
labels = sample.labels
worker = labels.get(LabelNames.WORKER)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use enums instead of strings to avoid typo problems hence I used LabelNames enum in metric definitions and use it here to make sure we access the correct one. If you think it's too much....we can always go back to strings.

)

@property
def transient_metrics(self) -> List[MetricWrapperBase]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think that this should be configurable, maybe we could make a second PR and add it as an additional feature. For now I think it makes total sense just to wipe any existing records from metrics for label sets containing offline workers. Otherwise the graphs are polluted with a reading from like a few days ago...

@Tomasz-Kluczkowski Tomasz-Kluczkowski changed the title WIP: Prometheus fix offline worker metrics Prometheus: fix offline worker metrics Aug 15, 2021
def _get_label_sets_for_offline_workers(
metric: MetricWrapperBase, offline_workers: Set[str]
) -> Set[Tuple[str, ...]]:
sampled_metrics = metric.collect()
Copy link
Contributor Author

@Tomasz-Kluczkowski Tomasz-Kluczkowski Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.collect() gets a Metric object with samples. The samples have a labels dictionary representing the label name to label value. We are interested in getting those values and checking if the one under key 'worker' is in our offline_workers set.

Not the prettiest way of getting this data but collect is the only common, public method of all the metric types we use that provides any info like this...

I hope I can at some point bake it into the metrics themselves.

Mention that prometheus metrics are also removed based on this setting.
@danyi1212
Copy link

@mher Can please provide details when this will be added?

@drummerwolli
Copy link

@Tomasz-Kluczkowski can you update your branch? then at least it will be possible for people to run (and test) your fork until this finally gets merged.

this not being fixed yet is a real pity. would love to see this move to the main branch and get properly released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants