SGX-enabled pods sometimes get created without SGX device mounted

**Describe the issue**
Not entirely sure if that is a bug in IDP, perhaps it just needs some explanation on how to properly ensure that sgx.intel.com/epc is available already or an upstream K8s/kubelet issue.

We are seeing a randomly occuring situation where Pods with "requests: sgx.intel.com/epc: 1Gi" defined in the Pod Spec get created without the necessary /dev/sgx_enclave volume mount. This causes the workload to fail due to lack of the SGX device (ENOENT error in Gramine). 

* When inspected by crictl, the mount points are missing in the containers. 
* We've seen this happening for Pods which are managed by Deployments, StatefulSets and Jobs,
* Usually, removing the affected resource (Depoy/STS/Job) & resyncing the resource (equal to 'kubectl replace') causes the Pod to start correctly,
* Delaying a specific container (e.g. by using sleep in an init-container) does not change a lot, perhaps because this aspect is connected with Pod (and not container) creation time.

_My (intuitive) understanding is that until "sgx.intel.com/epc" is registered on the node, the Pod should not get created (due to being unschedulable for insufficient resources)?_ 
If that intuition is wrong, what is the correct way to make sure that the Pod creation waits for "sgx.intel.com/epc" availability? Since this process is done as part of an automated flow (ArgoCD), we are not currently fully sure in what order the resources will get created.

**To Reproduce**
No clear reproduction scenario, happens randomly. We did not find a correlation with specific OCP version.

**Expected behavior**
Pod is created with the necessary volume mounts with /dev/sgx_enclave each time.

**System:**
 - OS version: Red Hat OpenShift 4.14, 4.15
 - Kernel version: RH kernel versions based on 5.14
 - Device plugins version: v0.28 from OpenShift OperatorHub
 - Hardware info: 4th and 5th Gen Intel Xeon Processors with QAT

**Additional context**
It might be an edge-case similar to the issue noticed for nVidia GPU (this specific comment): https://github.com/NVIDIA/k8s-device-plugin/issues/291#issuecomment-1041459716.
Similar as in the linked issue, the Pod somehow gets started before sgx.intel.com/epc can be succesfully allocated and mounted. Very curious why that happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SGX-enabled pods sometimes get created without SGX device mounted #1695

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SGX-enabled pods sometimes get created without SGX device mounted #1695

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions