Skip to content

Conversation

@wainersm
Copy link
Member

This is the first step towards to run the attestation-aware tests on AWS. It was introduced a new job to run the non-CoCo tests on EKS clusters, using AMD SEV-SNP podvm instances.

In e2e framework there were some code to instantiate EKS but it was broken and CoCo wasn't getting installed on Amazon Linux. I switched most of the code from AWS Go SDK to calling eksctl tool to provision/deprovision clusters as well as migrated to use Ubuntu 24.04 workers (where CoCo installs just fine). Along the way I had to do some fixes and adjustments here and there.

Although we have a clean up mechanism for dangling resources running in our AWS account, I've adapted (and fixed) the script that run after the job fail.

The new job will be running in continue-on-error until it's unstable. Yes, it's still a bit unstable because often the EKS cluster provision fails (need to investigate and fix).

Two executions that I used to test:

@wainersm wainersm requested a review from a team as a code owner December 10, 2025 13:34
@wainersm wainersm added CI Issues related to CI workflows provider/aws Issues related to AWS CAA provider labels Dec 10, 2025
@wainersm wainersm force-pushed the ci_aws_coco branch 2 times, most recently from fb0c4ff to 35ad928 Compare December 11, 2025 14:26
@wainersm
Copy link
Member Author

Updated just to fix golang lint warns.

Comment on lines 227 to 236
- crio
cluster_type:
- onprem
os:
- ubuntu
provider:
- generic
arch:
- amd64
include:
- container_runtime: containerd
cluster_type: eks
os: ubuntu
provider: generic
arch: amd64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we refactor this into a table for more readability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. that's what you wanted @stevenhorsman ?

The current support to create EKS relies on using the AWS SDK for Go
libraries. This has made the implementation a bit complex but it works,
however, in following changes it will be switched to use Ubuntu workers
which would require more uneeded code as we could be using a tool like
eksctl instead. That's exactly what this commit does: use eksctl to
create the EKS cluster.

Updated to k8s 1.34 as 1.26 is deprecated. Don't need to handle roles,
CNI plug-in, etc, because it all carried out by eksctl. However, still
relying on the VPC and subnets already created to run the podvms.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The kata-remote runtimeclass is taking longer than 60 to show up in EKS,
so this just increases the timeout.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The changes in the system for Kata/CoCo breaks containerd
in Amazon Linux workers. Switched to use Ubuntu 24.0 works.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
It will use eksctl to delete the EKS cluster. The tool should take care
the delete nodes groups, cloudformation resources and the cluster
itself. And 15 min of timeout should be enough.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Set the podvm_aws_instance_type property to use a instance type other
than default (t2.medium).

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
For regions other than us-east-1, it's needed to specify a
location constraint otherwise the creation fails.

Assisted-by: Cursor
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
For AMD SEV-SNP confidential VMs, it needs to boot UEFI mode. It will
automatically opt-in if disablecvm property is false (or empty).

Assisted-by: Cursor
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Just like other resources created (VCP, Subnet..etc) it needs to
have an unique name to avoid clash on CI.

Also if the eks_name property is passed then it won't attempt to create
the cluster, instead  assume the cluster was already created and it will
be re-used.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
For EKS clusters it creates two subnets but when the provisoner reads
from the aws_vpc_subnet_id it assumes it's a single subnet. Overhidden
the meaning of aws_vpc_subnet_id to allow passing two subnets separated
by comma.

Assisted-by: Cursor
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
We want to launch confidencial VM on EKS.

If cluste_type is eks then it needs to:
* install the eksctl command
* tweak the test properties to launch a confidential VM

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Use EKS to test confidential VMs on AWS.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
qemu-img is used to convert the podvm disk from qcow2 to raw to then
upload to AWS S3, so it is a requirement.

In the current onprem (kcli) job it's installed via
src/cloud-api-adaptor/libvirt/config_libvirt.sh as a collateral of
installing qemu-kvm. But with EKS job it doesn't run that script so
let's install the tool in a workflow step.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The hack/ci-e2e-aws-cleanup.sh is executed to ensure that resources are
cleaned up if the test framework exited before running the deprovision
code. Adapted the script all delete EKS.

The resources  were not necessarily created in same region as the
credential was, so let's workflow export the region in the AWS_REGION
variable and used it in the script.

Also notice that EKS is created with two subnets. The property
aws_vpc_subnet_id is a comma separated list of subnets then.

Assisted-by: Cursor
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
To avoid having to install uneeded libvirt packages in the CI runner.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
In "Config aws" the AWS_REGION is exported, that variable is used in the
hack/ci-e2e-aws-cleanup.sh to clean up dangling resources. However, if
it has "Configure aws credentials" running afterwards, AWS_REGION is
re-set to us-east-1. Let's run "Configure aws credentials" earlier to
avoid that problem.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
If the secondary subnet (for EKS) was create then it should be deleted
too otherwise the VPC cannot be destroyed.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Keep running the new job on CI but ignore the failures until it isn't
proved stable.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Issues related to CI workflows provider/aws Issues related to AWS CAA provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants