Long running tasks are terminated by k8s #3735

uzabanov · 2025-01-20T14:33:44Z

How to reproduce

Push an app (dorifi)
Start a long running task - cf run-task dorifi -c "sleep 10000"
Wait ~30 seconds - the job for the task fails
The task is shown as failed:

❯ cf tasks dorifi
Getting tasks for app dorifi in org org / space space as cf-admin...

id                  name                                   state    start time                      command
20250120143013390   3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c   FAILED   Mon, 20 Jan 2025 14:30:13 UTC   sleep 10000

Dev notes

Is that expected behaviour?
The job fails with the following codition:

  conditions:
  - lastProbeTime: "2025-01-20T14:30:52Z"
    lastTransitionTime: "2025-01-20T14:30:52Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed

There is the following k8s events:
- ExceededGracePeriod

Name:             3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml.181c6d069ffa849a
Namespace:        db3185c1-4d36-4391-b663-c1e96d39f84e
Labels:           <none>
Annotations:      <none>
API Version:      v1
Count:            1
Event Time:       <nil>
First Timestamp:  2025-01-20T14:30:31Z
Involved Object:
  API Version:       v1
  Kind:              Pod
  Name:              3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml
  Namespace:         db3185c1-4d36-4391-b663-c1e96d39f84e
  Resource Version:  2892
  UID:               60af529d-30bd-46cb-ba8c-b2f17fa14368
Kind:                Event
Last Timestamp:      2025-01-20T14:30:31Z
Message:             Container runtime did not kill the pod within specified grace period.
Metadata:
  Creation Timestamp:  2025-01-20T14:30:31Z
  Resource Version:    2996
  UID:                 7f0aeb1e-0bb5-4b98-b0a9-c39fc57b9762
Reason:                ExceededGracePeriod
Reporting Component:   kubelet
Reporting Instance:    e2e-control-plane
Source:
  Component:  kubelet
  Host:       e2e-control-plane
Type:         Warning
Events:       <none>

Kill event

Name:             3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml.181c6d044be52a59
Namespace:        db3185c1-4d36-4391-b663-c1e96d39f84e
Labels:           <none>
Annotations:      <none>
API Version:      v1
Count:            1
Event Time:       <nil>
First Timestamp:  2025-01-20T14:30:21Z
Involved Object:
  API Version:       v1
  Field Path:        spec.containers{workload}
  Kind:              Pod
  Name:              3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml
  Namespace:         db3185c1-4d36-4391-b663-c1e96d39f84e
  Resource Version:  2892
  UID:               60af529d-30bd-46cb-ba8c-b2f17fa14368
Kind:                Event
Last Timestamp:      2025-01-20T14:30:21Z
Message:             Stopping container workload
Metadata:
  Creation Timestamp:  2025-01-20T14:30:21Z
  Resource Version:    2945
  UID:                 057790a3-ca33-4bdf-98a8-f246c3513676
Reason:                Killing
Reporting Component:   kubelet
Reporting Instance:    e2e-control-plane
Source:
  Component:  kubelet
  Host:       e2e-control-plane
Type:         Normal
Events:       <none>

We believe that it is being killed maybe because the default terminationGracePeriodSeconds for the job is 30 seconds. Note that we do not set that in the job runner.
Note that in order to keep the job around, the following helm values are adjusted in deploy-on-kind:

--set=controllers.taskTTL="30m"
--set=jobTaskRunner.jobTTL="30m"

The text was updated successfully, but these errors were encountered:

uzabanov added the bug Something isn't working label Jan 20, 2025

korifi-bot added this to Korifi - Backlog Jan 20, 2025

github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running tasks are terminated by k8s #3735

Long running tasks are terminated by k8s #3735

uzabanov commented Jan 20, 2025 •

edited

Loading

Long running tasks are terminated by k8s #3735

Long running tasks are terminated by k8s #3735

Comments

uzabanov commented Jan 20, 2025 • edited Loading

How to reproduce

Dev notes

uzabanov commented Jan 20, 2025 •

edited

Loading