Skip to content

Race condition with worker_last_seen #376

@dakotak

Description

@dakotak

I have been seeing key error exceptions in our logs:

  File "/app/src/http_server.py", line 32, in metrics
    current_app.config["metrics_puller"]()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/src/exporter.py", line 155, in scrape
    self.track_timed_out_workers()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/src/exporter.py", line 217, in track_timed_out_workers
    self.purge_worker_metrics(hostname)
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/app/src/exporter.py", line 197, in purge_worker_metrics
    del self.worker_last_seen[hostname]
        ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'some_hostname'

I was considering making a pr to update the del statement to use self.worker_last_seen.pop(hostname, None) but I was not sure if there was a larger concurrency issue here?

We are also using --generic-hostname-task-sent-metric option.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions