Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: add canary jobs and sync mechanism #33993

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions config/jobs/kubernetes/sig-node/.dra-sync-settings
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Last commit which was synced into dynamic-resource-allocation.yaml.
last_sync=056f34d0fffe6816d4f4ef7acd2e14811ff5c67a
62 changes: 62 additions & 0 deletions config/jobs/kubernetes/sig-node/dra-sync.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env bash
# Copyright 2024 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Running this script will automatically take changes made to
# dynamic-resource-allocation-canary.yaml since the last sync
# (tracked in .dra-sync-settings) and create a commit which
# applies those changes to dynamic-resource-allocation.yaml.

set -o errexit
set -o nounset
set -o pipefail

REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../../.." && pwd -P)"
cd "${REPO_ROOT}"

# get "last_sync"
source "config/jobs/kubernetes/sig-node/.dra-sync-settings"

if [ -n "$(git diff --cached 2>&1)" ]; then
echo >&2 "ERROR: The git staging area must be clean."
exit 1
fi

new_sync=$(git rev-parse HEAD)

diff=$(git diff ${last_sync}..${new_sync} config/jobs/kubernetes/sig-node/dynamic-resource-allocation-canary.yaml | sed -e 's/-canary//g')

if [ -z "${diff}" ]; then
echo "No changes since last sync, nothing to do."
exit 0
fi

# Generate a "git format-patch" alike patch and apply it.
git am <<EOF
From ${new_sync}
From: dra-sync.sh helper script <[email protected]>
Date: $(date --rfc-email)
Subject: [PATCH 1/1] dra: apply changes from canary jobs

---
${diff}
$(diff -u config/jobs/kubernetes/sig-node/.dra-sync-settings <(sed -e "s/last_sync=.*/last_sync=${new_sync}/" config/jobs/kubernetes/sig-node/.dra-sync-settings) | sed -e 's;^--- .*;--- a/config/jobs/kubernetes/sig-node/.dra-sync-settings;' -e 's;+++ .*;+++ b/config/jobs/kubernetes/sig-node/.dra-sync-settings;')
EOF

git log -p -n1

cat <<EOF

Commit created successfully (see above). Review and submit as a pull request.
EOF
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
# This file contains canary periodic and presubmit jobs which run tests covering
# Dynamic Resource Allocation (DRA).
#
# The intent is to make all changes to DRA jobs first for these canary jobs,
# then copy the changes into dynamic-resource-allocation.yaml. Unless some
# experimental changes are being tested, a diff between the two files should
# be limited to:
# - this comment
# - job names
# - interval for the periodic jobs
#
# This command can be used to check this:
# diff dynamic-resource-allocation.yaml <(sed -e 's/-canary//' dynamic-resource-allocation-canary.yaml)


# `templates` has no special meaning. It just holds YAML anchors (= re-usable content)
# that get referenced below via YAML aliases: `<<: *job` includes the content
# and then allows adding or overwriting fields. This is done at the root. If a field
# contains lists or objects, that content gets replaced instead of merged.
#
# Lists cannot be extended the same way (https://github.com/yaml/yaml/issues/35).
#
# If unsure what the expanded jobs look like or to test parsing, run `go test -v .`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 @bart0sh: I went ahead and rewrote this file so that it avoids duplication between the different jobs. From 3942639:

dra: use YAML anchors and aliases to avoid duplication

These different jobs were all similar, with small variations.
Keeping them in sync was an on-going challenge that we lost:
- at some point, features were enabled differently in E2E node
  presubmits and periodics (didn't make a difference, but the
  they weren't the same anymore)
- resource settings for containerd vs CRI-O were different
  and it is unclear whether that was intentional (no comment
  about it)
- clusters where different

Defining common elements via YAML anchors once and reusing them via YAML
aliases avoids this.

`go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .`
can be used to see the full job definitions.

What do you think?

I know @bart0sh pointed out the difficulty of keeping periodics and presubmits in sync. This approach solves that, but is it readable and usable enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some test failures that would need further work. But let's first hear whether this is worth it at all.

Perhaps there is some other templating mechanism that can be used instead to generate the YAML files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better than what we currently have.
However, I'd prefer to use existing solution suggested by @dims on the slack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly It would be even better to avoid having a lot of configuration details in the generation script (https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/kops/build_jobs.py) and have a yaml file with a set of values, templates for various job types and a generic build_jobs script that generates result by rendering templates based on values.yaml. I can play with this if you don't have time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this PR is going down the wrong track. Instead of two files which we need to keep in sync, we should have a single dynamic-resource-allocation.yaml which is fully generated. That file should also include canary presubmits where we can try out experimental changes before changing the generator to apply them also to the normal jobs.

We don't need canary periodic jobs, for two reasons:

  • If the difference between presubmit and periodic is guaranteed to be limited to the interval vs. branch settings, then testing the job spec with a canary presubmit is sufficient.
  • It is less critical if a periodic job breaks because it doesn't affect other developers in the PRs.

I can play with this if you don't have time.

Thanks, please do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly this is what I came up with so far, PTAL: #34010

templates:
- &job
cluster: eks-prow-build-cluster
annotations: &annotations
testgrid-dashboards: sig-node-dynamic-resource-allocation
# Alerting is enabled also for PRs. If someone repeatedly tests a broken
# PR where the change introduced by the PR breaks tests, an alert will
# eventually be triggered. This is a good thing because then we can
# show the PR author how to do local testing...
#
# Disabled for canary jobs, enabled for real jobs.
testgrid-alert-email: # [email protected],[email protected] # #wg-device-management on Slack
fork-per-release: "false" # Only for canary jobs, must be true for real jobs.
decorate: true

- &kubernetes-master
org: kubernetes
repo: kubernetes
base_ref: master
path_alias: k8s.io/kubernetes

- &test-infra-master
org: kubernetes
repo: test-infra
base_ref: master
path_alias: k8s.io/test-infra

- &periodic-job
interval: 1000000h # Run only once on creation and when manually triggered.
extra_refs:
- *kubernetes-master

- &periodic-node-job
<<: *periodic-job
extra_refs:
- *kubernetes-master
- *test-infra-master # For test-infra/jobs/e2e_node files.

- &presubmit-job
skip_branches:
- release-\d+\.\d+ # per-release image
always_run: false
optional: true
skip_report: false
# run_if_changed is set judiciously on some jobs.

- &e2e-kind-job
<<: *job
decoration_config:
timeout: 3h
labels:
preset-service-account: "true"
preset-dind-enabled: "true"
preset-kind-volume-mounts: "true"
decoration_config:
timeout: 3h
spec:
containers:
- &e2e-kind-container
image: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241128-8df65c072f-master
command:
- runner.sh
- /bin/sh
- -xce
- |
make WHAT="github.com/onsi/ginkgo/v2/ginkgo k8s.io/kubernetes/test/e2e/e2e.test"
curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind
kind build node-image --image=dra/node:latest .
trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT
if ${with_all_features:-false}; then
# Which DRA features exist can change over time.
features=( $(grep '"DRA' pkg/features/kube_features.go | sed 's/.*"\(.*\)"/\1/') )
echo "Enabling DRA feature(s): ${features[*]}."
# Those additional features are not in kind.yaml, but they can be added at the end.
kind create cluster --retain --config <(cat test/e2e/dra/kind.yaml; for feature in ${features}; do echo " ${feature}: true"; done) --image dra/node:latest
KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=1h hack/ginkgo-e2e.sh -ginkgo.label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Alpha, Beta, DynamicResourceAllocation$(for feature in ${features}; do echo , ${feature}; done)} && !Flaky && !Slow"
else
kind create cluster --retain --config test/e2e/dra/kind.yaml --image dra/node:latest
KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=2h30m hack/ginkgo-e2e.sh -ginkgo.label-filter='Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky'
fi

# docker-in-docker needs privileged mode
securityContext:
privileged: true
resources:
limits:
cpu: 2
memory: 9Gi
requests:
cpu: 2
memory: 9Gi

- &e2e-node-job
<<: *job
labels:
preset-service-account: "true"
preset-k8s-ssh: "true"
decoration_config:
timeout: 90m
spec:
containers:
- &e2e-node-container
image: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241128-8df65c072f-master
command:
- /bin/bash
- -ce
- |
export IGNITION_INJECT_GCE_SSH_PUBLIC_KEY_FILE=1
export GOPATH=/go

# Compose the command depending on the `runtime` variable set for the job.
cmd=(
runner.sh
/workspace/scenarios/kubernetes_e2e.py
--deployment=node
--env=KUBE_SSH_USER=core
--gcp-zone=us-west1-b
--node-tests=true
--provider=gce
'--test_args=--timeout=1h --label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky"'
--timeout=65m
)
case ${runtime:-containerd} in
containerd)
cmd+=(
'--node-test-args=--feature-gates=DynamicResourceAllocation=true --service-feature-gates=DynamicResourceAllocation=true --runtime-config=api/beta=true --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-process-name=/usr/bin/containerd --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/containerd.service" --extra-log="{\"name\": \"containerd.log\", \"journalctl\": [\"-u\", \"containerd\"]}"'
--node-args=--image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/dra/image-config-containerd-1.7.yaml
)
;;
crio-cgroupv1)
cmd+=(
'--node-test-args=--feature-gates=DynamicResourceAllocation=true --service-feature-gates=DynamicResourceAllocation=true --runtime-config=api/beta=true --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"'
--node-args=--image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1-serial.yaml
)
;;
crio-cgroupv2)
cmd+=(
'--node-test-args=--feature-gates=DynamicResourceAllocation=true --service-feature-gates=DynamicResourceAllocation=true --runtime-config=api/beta=true --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"'
--node-args=--image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv2-serial.yaml
)
;;
esac

# Run it.
set -x
${cmd[@]}

resources:
limits:
cpu: 2
memory: 9Gi
requests:
cpu: 2
memory: 9Gi

periodics:

- <<: *e2e-kind-job
<<: *periodic-job
name: ci-kind-dra-canary
annotations:
<<: *annotations
testgrid-tab-name: ci-kind-dra-canary
description: Runs E2E tests for Dynamic Resource Allocation beta features against a Kubernetes master cluster created with sigs.k8s.io/kind

- <<: *e2e-kind-job
<<: *periodic-job
name: ci-kind-dra-all-canary
annotations:
<<: *annotations
testgrid-tab-name: ci-kind-dra-all-canary
description: Runs E2E tests for Dynamic Resource Allocation alpha and beta features against a Kubernetes master cluster created with sigs.k8s.io/kind
spec:
containers:
- <<: *e2e-kind-container
env:
- name: with_all_features
value: "true"

- <<: *e2e-node-job
<<: *periodic-node-job
name: ci-node-e2e-cgrpv1-crio-dra-canary
annotations:
<<: *annotations
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation
testgrid-tab-name: ci-node-e2e-cgrpv1-crio-dra-canary
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1
spec:
containers:
- <<: *e2e-node-container
env:
- name: runtime
value: crio-cgroupv1

- <<: *e2e-node-job
<<: *periodic-node-job
name: ci-node-e2e-cgrpv2-crio-dra-canary
annotations:
<<: *annotations
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v2
testgrid-tab-name: ci-node-e2e-cgrpv2-crio-dra-canary
spec:
containers:
- <<: *e2e-node-container
env:
- name: runtime
value: crio-cgroupv1

- <<: *e2e-node-job
<<: *periodic-node-job
name: ci-node-e2e-containerd-dra-canary
annotations:
<<: *annotations
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1
testgrid-tab-name: ci-node-e2e-containerd-dra-canary
spec:
containers:
- <<: *e2e-node-container
env:
- name: runtime
value: containerd

presubmits:
kubernetes/kubernetes:

- <<: *e2e-kind-job
<<: *presubmit-job
name: pull-kubernetes-kind-dra-canary
annotations:
<<: *annotations
testgrid-tab-name: pull-kubernetes-kind-dra-canary
description: Runs E2E tests for Dynamic Resource Allocation beta features against a Kubernetes master cluster created with sigs.k8s.io/kind
run_if_changed: &e2e-run-if-changed-e2e /(dra|dynamicresources|resourceclaim|deviceclass|resourceslice|resourceclaimtemplate|dynamic-resource-allocation|pkg/apis/resource|api/resource)/.*.go

- <<: *e2e-kind-job
<<: *presubmit-job
name: pull-kubernetes-kind-dra-all-canary
annotations:
<<: *annotations
testgrid-tab-name: pull-kubernetes-kind-dra-all-canary
description: Runs E2E tests for Dynamic Resource Allocation alpha and beta features against a Kubernetes master cluster created with sigs.k8s.io/kind
run_if_changed: *e2e-run-if-changed-e2e
spec:
containers:
- <<: *e2e-kind-container
env:
- name: with_all_features
value: "true"

- <<: *e2e-node-job
<<: *presubmit-job
name: pull-kubernetes-node-e2e-cgrpv1-crio-dra-canary
annotations:
<<: *annotations
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation
testgrid-tab-name: pull-kubernetes-node-e2e-cgrpv1-crio-dra-canary
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1
# Automatically testing with one container runtime in one configuration is sufficient to detect basic problems in kubelet early.
# CRI-O was picked because it was solid for testing so far.
# Periodic variant: ci-node-e2e-crio-cgrpv1-dra
run_if_changed: (/dra/|/dynamicresources/|/resourceclaim/|/deviceclass/|/resourceslice/|/resourceclaimtemplate/|/dynamic-resource-allocation/|/pkg/apis/resource/|/api/resource/|/test/e2e_node/dra_).*\.(go|yaml)
spec:
containers:
- <<: *e2e-node-container
env:
- name: runtime
value: crio-cgroupv1

- <<: *e2e-node-job
<<: *presubmit-job
name: pull-kubernetes-node-e2e-cgrpv2-crio-dra-canary
annotations:
<<: *annotations
testgrid-dashboards: sig-node-cri-o, sig-node-dynamic-resource-allocation
description: Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v2
testgrid-tab-name: pull-kubernetes-node-e2e-cgrpv2-crio-dra-canary
spec:
containers:
- <<: *e2e-node-container
env:
- name: runtime
value: crio-cgroupv2

- <<: *e2e-node-job
<<: *presubmit-job
name: pull-kubernetes-node-e2e-containerd-dra-canary
annotations:
<<: *annotations
description: Runs E2E node tests for Dynamic Resource Allocation beta features with containerd
testgrid-tab-name: pull-kubernetes-node-e2e-containerd-dra-canary
spec:
containers:
- <<: *e2e-node-container
env:
- name: runtime
value: containerd
Loading