Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping MCAD CPU Preemption Test #696

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
701e8e9
Adding skip to flaky tests
Fiona-Waters Dec 7, 2023
4db97fb
use only 2 cpus
asm582 Dec 15, 2023
f5f6743
add dockerd cmd
asm582 Dec 15, 2023
816733a
remove kind resource config
asm582 Dec 15, 2023
41961a9
add docker res config
asm582 Dec 15, 2023
f8a91da
debug docker res config-1
asm582 Dec 15, 2023
2d8401c
debug docker res config-2
asm582 Dec 15, 2023
fa08868
debug docker res config-3
asm582 Dec 15, 2023
b7917d6
debug docker res config-4
asm582 Dec 15, 2023
265214a
debug docker res config-5
asm582 Dec 15, 2023
2ff4f37
debug docker res config-6
asm582 Dec 15, 2023
25e6ec1
debug docker res config-7
asm582 Dec 15, 2023
fba9c92
debug docker res config-8
asm582 Dec 15, 2023
1182c22
debug docker res config-9
asm582 Dec 15, 2023
b45b7af
debug docker res config-10
asm582 Dec 15, 2023
58aa461
debug docker res config-11
asm582 Dec 15, 2023
65f9a8e
debug docker res config-12
asm582 Dec 15, 2023
105f3ff
debug docker res config-13
asm582 Dec 15, 2023
8f536c0
debug docker res config-14
asm582 Dec 15, 2023
45345be
debug docker res config-15
asm582 Dec 15, 2023
10bb865
debug docker res config-16
asm582 Dec 15, 2023
fddcc58
debug docker res config-17
asm582 Dec 15, 2023
40648bb
debug docker res config-18
asm582 Dec 15, 2023
73f374e
debug docker res config-19
asm582 Dec 15, 2023
dd1c862
debug docker res config-20
asm582 Dec 15, 2023
a991ddc
debug docker res config-21
asm582 Dec 15, 2023
883e04a
debug docker res config-22
asm582 Dec 16, 2023
ff376f5
fix failing test
asm582 Dec 16, 2023
8eeaaf8
fix test 2
asm582 Dec 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion .github/workflows/mcad-CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,25 @@ on:
jobs:
MCAD-CI:
runs-on: ubuntu-latest

steps:
- name: run docker resource config
run: |
# sudo touch /etc/systemd/system/docker_limit.slice
# cat <<EOF > /etc/systemd/system/docker_limit.slice
# [Unit]
# Description=Slice that limits docker resources
# Before=slices.target
# [Slice]
# CPUAccounting=true
# CPUQuota=50%
# EOF
# sudo systemctl start /etc/systemd/system/docker_limit.slice
# new_content='{ "exec-opts": ["native.cgroupdriver=cgroupfs"], "cgroup-parent": "/docker_limit.slice" }'
# sudo sed -i 's|{ "exec-opts": \["native.cgroupdriver=cgroupfs"\], "cgroup-parent": "/actions_job" }|'"$new_content"'|' /etc/docker/daemon.json
# cat /etc/docker/daemon.json
# sudo systemctl restart docker
# sleep 10
docker info | grep CPU
- name: checkout code
uses: actions/checkout@v3
with:
Expand Down
55 changes: 42 additions & 13 deletions test/e2e/queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,9 @@ var _ = Describe("AppWrapper E2E Test", func() {
})

It("MCAD CPU Preemption Test", func() {

Skip("Skipping MCAD CPU Preemption Test - [Bug] Failing intermittently on opened PRs")

fmt.Fprintf(os.Stdout, "[e2e] MCAD CPU Preemption Test - Started.\n")

context := initTestContext()
Expand Down Expand Up @@ -126,6 +129,9 @@ var _ = Describe("AppWrapper E2E Test", func() {
})

It("MCAD CPU Requeuing - Completion After Enough Requeuing Times Test", func() {

Skip("Skipping MCAD CPU Requeuing - Completion After Enough Requeuing Times Test - [Bug] Failing intermittently on opened PRs")

fmt.Fprintf(os.Stdout, "[e2e] Completion After Enough Requeuing Times Test - Started.\n")

context := initTestContext()
Expand All @@ -146,6 +152,9 @@ var _ = Describe("AppWrapper E2E Test", func() {
})

It("MCAD CPU Requeuing - Deletion After Maximum Requeuing Times Test", func() {

Skip("Skipping MCAD CPU Requeuing - Deletion After Maximum Requeuing Times Test - [Bug] Failing intermittently on opened PRs")

fmt.Fprintf(os.Stdout, "[e2e] MCAD CPU Requeuing - Deletion After Maximum Requeuing Times Test - Started.\n")

context := initTestContext()
Expand Down Expand Up @@ -418,15 +427,20 @@ var _ = Describe("AppWrapper E2E Test", func() {
appwrappersPtr := &appwrappers
defer cleanupTestObjectsPtr(context, appwrappersPtr)

// This should fill up the worker node and most of the master node
aw := createDeploymentAWwith550CPU(context, appendRandomString("aw-deployment-2-550cpu"))
//This should fill up the worker node and most of the master node
//aw := createDeploymentAWwith550CPU(context, appendRandomString("aw-deployment-2-550cpu"))
cap := getClusterCapacitycontext(context)
resource := cpuDemand(cap, 0.275).String()
aw := createGenericDeploymentCustomPodResourcesWithCPUAW(
Copy link
Contributor

@KPostOffice KPostOffice Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the cluster has many smaller nodes resulting a a high cap but inability to schedule AppWrappers becauase they do not fit on the individual nodes? Do we care about that at all in this test case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a test case perspective, the cluster is assumed to have homogenous nodes and it requests deployments that fit on a node in the cluster in CPU dimension.

context, appendRandomString("aw-ff-deployment-55-percent-cpu"), resource, resource, 2, 60)
appwrappers = append(appwrappers, aw)
err := waitAWPodsReady(context, aw)
Expect(err).NotTo(HaveOccurred(), "Expecting pods for app wrapper: aw-deployment-2-550cpu")
Expect(err).NotTo(HaveOccurred(), "Expecting pods for app wrapper: aw-ff-deployment-1-3500-cpu")

// This should not fit on any node but should dispatch because there is enough aggregated resources.
resource2 := cpuDemand(cap, 0.4).String()
aw2 := createGenericDeploymentCustomPodResourcesWithCPUAW(
context, appendRandomString("aw-ff-deployment-1-850-cpu"), "850m", "850m", 1, 60)
context, appendRandomString("aw-ff-deployment-40-percent-cpu"), resource2, resource2, 1, 60)

appwrappers = append(appwrappers, aw2)

Expand All @@ -439,18 +453,19 @@ var _ = Describe("AppWrapper E2E Test", func() {

// This should fit on cluster after AW aw-deployment-1-850-cpu above is automatically preempted on
// scheduling failure
resource3 := cpuDemand(cap, 0.15).String()
aw3 := createGenericDeploymentCustomPodResourcesWithCPUAW(
context, appendRandomString("aw-ff-deployment-2-340-cpu"), "340m", "340m", 2, 60)
context, appendRandomString("aw-ff-deployment-15-percent-cpu"), resource3, resource3, 2, 60)

appwrappers = append(appwrappers, aw3)

// Wait for pods to get created, assumes preemption around 10 minutes
err = waitAWPodsExists(context, aw3, 720000*time.Millisecond)
Expect(err).NotTo(HaveOccurred(), "Expecting pods for app wrapper: aw-ff-deployment-2-340-cpu")
Expect(err).NotTo(HaveOccurred(), "Expecting pods for app wrapper: aw-ff-deployment-15-percent-cpu")
fmt.Fprintf(GinkgoWriter, "[e2e] MCAD Scheduling Fail Fast Preemption Test - Pods not found for app wrapper aw-ff-deployment-2-340-cpu\n")

err = waitAWPodsReady(context, aw3)
Expect(err).NotTo(HaveOccurred(), "Expecting no pods for app wrapper: aw-ff-deployment-2-340-cpu")
Expect(err).NotTo(HaveOccurred(), "Expecting no pods for app wrapper: aw-ff-deployment-15-percent-cpu")
fmt.Fprintf(GinkgoWriter, "[e2e] MCAD Scheduling Fail Fast Preemption Test - Ready pods found for app wrapper aw-ff-deployment-2-340-cpu\n")

// Make sure pods from AW aw-deployment-1-850-cpu have preempted
Expand Down Expand Up @@ -486,15 +501,21 @@ var _ = Describe("AppWrapper E2E Test", func() {
defer cleanupTestObjectsPtr(context, appwrappersPtr)

// This should fill up the worker node and most of the master node
aw := createDeploymentAWwith550CPU(context, appendRandomString("aw-deployment-2-550cpu"))
cap := getClusterCapacitycontext(context)
resource := cpuDemand(cap, 0.275).String()
aw := createGenericDeploymentCustomPodResourcesWithCPUAW(
context, appendRandomString("aw-ff-deployment-55-percent-cpu"), resource, resource, 2, 60)
appwrappers = append(appwrappers, aw)

err := waitAWPodsReady(context, aw)
Expect(err).NotTo(HaveOccurred(), "Expecting pods to be ready for app wrapper: aw-deployment-2-550cpu")

// This should not fit on cluster but customPodResources is incorrect so AW pods are created
// aw2 := createGenericDeploymentCustomPodResourcesWithCPUAW(
// context, appendRandomString("aw-deployment-2-425-vs-426-cpu"), "425m", "426m", 2, 60)
resource2 := cpuDemand(cap, 0.5).String()
aw2 := createGenericDeploymentCustomPodResourcesWithCPUAW(
context, appendRandomString("aw-deployment-2-425-vs-426-cpu"), "425m", "426m", 2, 60)
context, appendRandomString("aw-ff-deployment-40-percent-cpu"), "425m", resource2, 1, 60)

appwrappers = append(appwrappers, aw2)

Expand All @@ -506,6 +527,7 @@ var _ = Describe("AppWrapper E2E Test", func() {
})

It("MCAD Bad Custom Pod Resources vs. Deployment Pod Resource Queuing Test 2", func() {
Skip("MCAD Bad Custom Pod Resources vs. Deployment Pod Resource Queuing Test 2 - Deployment controller removed and this test case does not apply")
fmt.Fprintf(os.Stdout, "[e2e] MCAD Bad Custom Pod Resources vs. Deployment Pod Resource Queuing Test 2 - Started.\n")
context := initTestContext()
var appwrappers []*arbv1.AppWrapper
Expand Down Expand Up @@ -649,18 +671,25 @@ var _ = Describe("AppWrapper E2E Test", func() {
defer cleanupTestObjectsPtr(context, appwrappersPtr)

// This should fill up the worker node and most of the master node
aw := createDeploymentAWwith550CPU(context, appendRandomString("aw-deployment-2-550cpu"))
//aw := createDeploymentAWwith550CPU(context, appendRandomString("aw-deployment-2-550cpu"))
cap := getClusterCapacitycontext(context)
resource := cpuDemand(cap, 0.275).String()
aw := createGenericDeploymentCustomPodResourcesWithCPUAW(
context, appendRandomString("aw-ff-deployment-55-percent-cpu"), resource, resource, 2, 60)
appwrappers = append(appwrappers, aw)
err := waitAWPodsReady(context, aw)
Expect(err).NotTo(HaveOccurred(), "Waiting for pods to be ready for app wrapper: aw-deployment-2-550cpu")
Expect(err).NotTo(HaveOccurred(), "Waiting for pods to be ready for app wrapper: aw-ff-deployment-55-percent-cpu")

// This should not fit on cluster
// there may be a false positive dispatch which will cause MCAD to requeue AW
aw2 := createDeploymentAWwith426CPU(context, appendRandomString("aw-deployment-2-426cpu"))
//aw2 := createDeploymentAWwith426CPU(context, appendRandomString("aw-deployment-2-426cpu"))
resource2 := cpuDemand(cap, 0.5).String()
aw2 := createGenericDeploymentCustomPodResourcesWithCPUAW(
context, appendRandomString("aw-ff-deployment-40-percent-cpu"), resource2, resource2, 1, 60)
appwrappers = append(appwrappers, aw2)

err = waitAWPodsReady(context, aw2)
Expect(err).To(HaveOccurred(), "No pods for app wrapper `aw-deployment-2-426cpu` are expected.")
Expect(err).To(HaveOccurred(), "No pods for app wrapper `aw-ff-deployment-40-percent-cpu` are expected.")
})

It("MCAD Deployment RunningHoldCompletion Test", func() {
Expand Down
38 changes: 38 additions & 0 deletions test/e2e/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ import (

arbv1 "github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/controller/v1beta1"
versioned "github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/client/clientset/versioned"
"github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/clusterstate/api"
clusterstateapi "github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/clusterstate/api"
)

var ninetySeconds = 90 * time.Second
Expand Down Expand Up @@ -793,6 +795,36 @@ func createDeploymentAWwith550CPU(context *context, name string) *arbv1.AppWrapp
return appwrapper
}

func getClusterCapacitycontext(context *context) *clusterstateapi.Resource {
capacity := clusterstateapi.EmptyResource()
nodes, _ := context.kubeclient.CoreV1().Nodes().List(context.ctx, metav1.ListOptions{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the error here.

for _, node := range nodes.Items {
// skip unschedulable nodes
if node.Spec.Unschedulable {
continue
}
nodeResource := clusterstateapi.NewResource(node.Status.Allocatable)
capacity.Add(nodeResource)
var specNodeName = "spec.nodeName"
labelSelector := fmt.Sprintf("%s=%s", specNodeName, node.Name)
podList, err := context.kubeclient.CoreV1().Pods("").List(context.ctx, metav1.ListOptions{FieldSelector: labelSelector})
// TODO: when no pods are listed, do we send entire node capacity as available
// this will cause false positive dispatch.
if err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the error be caught like this instead?

Suggested change
if err != nil {
Expect(err).NotTo(HaveOccurred()

fmt.Errorf("[allocatableCapacity] Error listing pods %v", err)
}
for _, pod := range podList.Items {
if _, ok := pod.GetLabels()["appwrappers.mcad.ibm.com"]; !ok && pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded {
for _, container := range pod.Spec.Containers {
usedResource := clusterstateapi.NewResource(container.Resources.Requests)
capacity.Sub(usedResource)
}
}
}
}
return capacity
}

func createDeploymentAWwith350CPU(context *context, name string) *arbv1.AppWrapper {
rb := []byte(`{"apiVersion": "apps/v1",
"kind": "Deployment",
Expand Down Expand Up @@ -2705,3 +2737,9 @@ func AppWrapper(context *context, namespace string, name string) func(g gomega.G
func AppWrapperState(aw *arbv1.AppWrapper) arbv1.AppWrapperState {
return aw.Status.State
}

func cpuDemand(cap *api.Resource, fractionOfCluster float64) *resource.Quantity {
//klog.Infof("[allocatableCapacity] The available capacity to dispatch appwrapper is %v and time took to calculate is %v", capacity, time.Since(startTime))
milliDemand := int64(float64(cap.MilliCPU) * fractionOfCluster)
return resource.NewMilliQuantity(milliDemand, resource.DecimalSI)
}