Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565

Open
shlomitubul opened this issue Nov 30, 2024 · 3 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@shlomitubul
Copy link

shlomitubul commented Nov 30, 2024

What happened:
When Jobs failed and then succeeded and failedJobsHistoryLimit are more then 0 then kube_job_status_failed will create duplicate samples since kube_job_status_failed has no uniq label like job_pod.

What you expected to happen:
kube_job_* metrics or at least kube_job_status_failed metrics should have some unique label (can be job_pod or retry_index) so promethus doesn't reject the scrape.

How to reproduce it (as minimally and precisely as possible):
create Job resource with failedJobsHistoryLimit: 1, and backoffLimit: 2, trigger the job and exit the pod, then trigger again and let it pass.

Anything else we need to know?:

Environment:

  • kube-state-metrics version: 2.14.0
  • Kubernetes version (use kubectl version): v1.30.5-gke.1014003
  • Cloud provider or hardware configuration: GKE
  • Other info:

Tasks

Preview Give feedback
No tasks being tracked yet.
@shlomitubul shlomitubul added the kind/bug Categorizes issue or PR as related to a bug. label Nov 30, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 30, 2024
@zoglam
Copy link

zoglam commented Dec 12, 2024

@shlomitubul
btw, I cant reproduce this scenario. Is there any additional information or an example cronjob manifest? What about collectors in your system and their settings. Tell more about the number of pods in the cluster and cronjobs

My attempt. I am also tried failedJobsHistoryLimit with high value

apiVersion: batch/v1
kind: CronJob
metadata:
  name: sleep-cronjob
  namespace: default
spec:
  schedule: '*/1 * * * *'
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          containers:
            - name: sleep-container
              image: alpine:3.12
              command:
                - sh
                - '-c'
                - |
                  sleep 30
                  if [ ! -f /tmp/1.txt ]; then
                    touch /tmp/1.txt && exit 1
                  fi
                  exit 0
              volumeMounts: [{ name: files-volume, mountPath: /tmp }]
          volumes: [{ name: files-volume, emptyDir: {} }]
          restartPolicy: OnFailure

@shlomitubul
Copy link
Author

shlomitubul commented Dec 13, 2024

@zoglam it re-produced everytime this is the cronjob I used:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: renovate
  namespace: renovate
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - env:
            - name: RENOVATE_CONFIG_FILE
              value: /usr/src/app/config.json
            - name: LOG_LEVEL
              value: INFO
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: http://opentelemetry-collector.open-telemetry:4318
            image: ghcr.io/renovatebot/renovate:39.8.0
            imagePullPolicy: IfNotPresent
            name: renovate
            resources:
              limits:
                memory: 3Gi
              requests:
                cpu: "1"
                memory: 3Gi
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext:
            fsGroup: 1000
          serviceAccount: renovate-sa
          serviceAccountName: renovate-sa
          terminationGracePeriodSeconds: 30
  schedule: '@hourly'
  successfulJobsHistoryLimit: 3
  suspend: false

we used Prometheus in each cluster that send metrics to mimir, the number of pods in each cron/job is 1, we don't allow/need cuncrruny, so once we trigger job and the main process for example and then let next one pass then we get this error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants