Skip to content

Operator does not notice an interrupted backup when pod is evicted #2281

@horihel

Description

@horihel

Report

We've seen this a few times now on our clusters.

It looks like psmdb-operator does not check with PBM if a running backup is actually running. If PBM gets interrupted (in our case due to node drain), psmdb-operator might still believe the backup is running.

Unfortunately this prevents any following backup jobs from running until the stale resource is deleted manually.

More about the problem

This only seems to happen if the whole pod gets removed - we have several occasions where backups failed (for example due to mongod getting OOMKilled during backup). As long as the pbm container stays alive, operator gets notificed.

We're also not sure if this might be connected to psmdb-operator being moved to another node during the operation - which might mean it can't "see" the transition from "running" to "interrupted/killed/error".
PBM itself is very aware that no backup is running and that operation was interrupted, the operator just doesn't notice and doesn't seem to actively synchronize the backup status with PBM.

Steps to reproduce

  1. Have a long-running (1h+) backup running - we've tested physical but I don't think the type matters
  2. AKS planned maintenance kicks in and starts draining and replacing nodes
  3. eventually the pod doing the backup gets killed
  4. PBM gets interrupted and cleans up properly (no leftover locks) but the operator does not notice
  5. The interrupted backup stays as "running" in the pbm-backup resource
  6. No further backup schedules are respected. New resources are created, but stay untouched until the "running" stale resource is deleted manually.

Versions

  1. Kubernetes: AKS 1.33.6
  2. Operator: 1.21.2
  3. Database: 8.0.17-6

Anything else?

If needed, I can try to gather more information once this happens again.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions