Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running tasks are terminated by k8s #3735

Open
uzabanov opened this issue Jan 20, 2025 · 0 comments
Open

Long running tasks are terminated by k8s #3735

uzabanov opened this issue Jan 20, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@uzabanov
Copy link
Contributor

uzabanov commented Jan 20, 2025

How to reproduce

  • Push an app (dorifi)
  • Start a long running task - cf run-task dorifi -c "sleep 10000"
  • Wait ~30 seconds - the job for the task fails
  • The task is shown as failed:
❯ cf tasks dorifi
Getting tasks for app dorifi in org org / space space as cf-admin...

id                  name                                   state    start time                      command
20250120143013390   3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c   FAILED   Mon, 20 Jan 2025 14:30:13 UTC   sleep 10000

Dev notes

  • Is that expected behaviour?
  • The job fails with the following codition:
  conditions:
  - lastProbeTime: "2025-01-20T14:30:52Z"
    lastTransitionTime: "2025-01-20T14:30:52Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  • There is the following k8s events:
    • ExceededGracePeriod
Name:             3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml.181c6d069ffa849a
Namespace:        db3185c1-4d36-4391-b663-c1e96d39f84e
Labels:           <none>
Annotations:      <none>
API Version:      v1
Count:            1
Event Time:       <nil>
First Timestamp:  2025-01-20T14:30:31Z
Involved Object:
  API Version:       v1
  Kind:              Pod
  Name:              3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml
  Namespace:         db3185c1-4d36-4391-b663-c1e96d39f84e
  Resource Version:  2892
  UID:               60af529d-30bd-46cb-ba8c-b2f17fa14368
Kind:                Event
Last Timestamp:      2025-01-20T14:30:31Z
Message:             Container runtime did not kill the pod within specified grace period.
Metadata:
  Creation Timestamp:  2025-01-20T14:30:31Z
  Resource Version:    2996
  UID:                 7f0aeb1e-0bb5-4b98-b0a9-c39fc57b9762
Reason:                ExceededGracePeriod
Reporting Component:   kubelet
Reporting Instance:    e2e-control-plane
Source:
  Component:  kubelet
  Host:       e2e-control-plane
Type:         Warning
Events:       <none>
  • Kill event
Name:             3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml.181c6d044be52a59
Namespace:        db3185c1-4d36-4391-b663-c1e96d39f84e
Labels:           <none>
Annotations:      <none>
API Version:      v1
Count:            1
Event Time:       <nil>
First Timestamp:  2025-01-20T14:30:21Z
Involved Object:
  API Version:       v1
  Field Path:        spec.containers{workload}
  Kind:              Pod
  Name:              3d6aea4f-d8c0-4545-b2e4-c2c3fa20d91c-hc5ml
  Namespace:         db3185c1-4d36-4391-b663-c1e96d39f84e
  Resource Version:  2892
  UID:               60af529d-30bd-46cb-ba8c-b2f17fa14368
Kind:                Event
Last Timestamp:      2025-01-20T14:30:21Z
Message:             Stopping container workload
Metadata:
  Creation Timestamp:  2025-01-20T14:30:21Z
  Resource Version:    2945
  UID:                 057790a3-ca33-4bdf-98a8-f246c3513676
Reason:                Killing
Reporting Component:   kubelet
Reporting Instance:    e2e-control-plane
Source:
  Component:  kubelet
  Host:       e2e-control-plane
Type:         Normal
Events:       <none>
  • We believe that it is being killed maybe because the default terminationGracePeriodSeconds for the job is 30 seconds. Note that we do not set that in the job runner.
  • Note that in order to keep the job around, the following helm values are adjusted in deploy-on-kind:
--set=controllers.taskTTL="30m"
--set=jobTaskRunner.jobTTL="30m"

@uzabanov uzabanov added the bug Something isn't working label Jan 20, 2025
@github-project-automation github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

1 participant