duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565

shlomitubul · 2024-11-30T15:37:57Z

What happened:
When Jobs failed and then succeeded and failedJobsHistoryLimit are more then 0 then kube_job_status_failed will create duplicate samples since kube_job_status_failed has no uniq label like job_pod.

What you expected to happen:
kube_job_* metrics or at least kube_job_status_failed metrics should have some unique label (can be job_pod or retry_index) so promethus doesn't reject the scrape.

How to reproduce it (as minimally and precisely as possible):
create Job resource with failedJobsHistoryLimit: 1, and backoffLimit: 2, trigger the job and exit the pod, then trigger again and let it pass.

Anything else we need to know?:

Environment:

kube-state-metrics version: 2.14.0
Kubernetes version (use kubectl version): v1.30.5-gke.1014003
Cloud provider or hardware configuration: GKE
Other info:

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-11-30T15:38:05Z

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

zoglam · 2024-12-12T14:25:38Z

@shlomitubul
btw, I cant reproduce this scenario. Is there any additional information or an example cronjob manifest? What about collectors in your system and their settings. Tell more about the number of pods in the cluster and cronjobs

My attempt. I am also tried failedJobsHistoryLimit with high value

apiVersion: batch/v1
kind: CronJob
metadata:
  name: sleep-cronjob
  namespace: default
spec:
  schedule: '*/1 * * * *'
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          containers:
            - name: sleep-container
              image: alpine:3.12
              command:
                - sh
                - '-c'
                - |
                  sleep 30
                  if [ ! -f /tmp/1.txt ]; then
                    touch /tmp/1.txt && exit 1
                  fi
                  exit 0
              volumeMounts: [{ name: files-volume, mountPath: /tmp }]
          volumes: [{ name: files-volume, emptyDir: {} }]
          restartPolicy: OnFailure

shlomitubul · 2024-12-13T15:09:46Z

@zoglam it re-produced everytime this is the cronjob I used:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: renovate
  namespace: renovate
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - env:
            - name: RENOVATE_CONFIG_FILE
              value: /usr/src/app/config.json
            - name: LOG_LEVEL
              value: INFO
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: http://opentelemetry-collector.open-telemetry:4318
            image: ghcr.io/renovatebot/renovate:39.8.0
            imagePullPolicy: IfNotPresent
            name: renovate
            resources:
              limits:
                memory: 3Gi
              requests:
                cpu: "1"
                memory: 3Gi
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext:
            fsGroup: 1000
          serviceAccount: renovate-sa
          serviceAccountName: renovate-sa
          terminationGracePeriodSeconds: 30
  schedule: '@hourly'
  successfulJobsHistoryLimit: 3
  suspend: false

we used Prometheus in each cluster that send metrics to mimir, the number of pods in each cron/job is 1, we don't allow/need cuncrruny, so once we trigger job and the main process for example and then let next one pass then we get this error

shlomitubul added the kind/bug Categorizes issue or PR as related to a bug. label Nov 30, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565

duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565

shlomitubul commented Nov 30, 2024 •

edited

Loading

Tasks

k8s-ci-robot commented Nov 30, 2024

zoglam commented Dec 12, 2024 •

edited

Loading

shlomitubul commented Dec 13, 2024 •

edited

Loading

duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565

duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565

Comments

shlomitubul commented Nov 30, 2024 • edited Loading

Tasks

k8s-ci-robot commented Nov 30, 2024

zoglam commented Dec 12, 2024 • edited Loading

shlomitubul commented Dec 13, 2024 • edited Loading

shlomitubul commented Nov 30, 2024 •

edited

Loading

zoglam commented Dec 12, 2024 •

edited

Loading

shlomitubul commented Dec 13, 2024 •

edited

Loading