Add registry allow list for DatadogLibrary CSI volumes#72
Add registry allow list for DatadogLibrary CSI volumes#72
Conversation
Adds a configurable allow list of trusted registries for DatadogLibrary volumes. When non-empty, publish requests specifying an unlisted registry are rejected. Empty list (default) allows all registries, preserving backward compatibility.
|
Helm chart PR: DataDog/helm-charts#2488 |
Fix two bugs in the registry allow list feature: 1. Viper's GetStringSlice treats a comma-separated env var as a single element. DD_REGISTRY_ALLOW_LIST=a,b was parsed as ["a,b"] instead of ["a","b"]. Add a helper that splits entries on commas. 2. NodePublishVolume returned a gRPC error when a publisher failed, which caused pods to be stuck in ContainerCreating indefinitely. Now log the error and return success so pods start without the volume content (graceful degradation).
iamluc
left a comment
There was a problem hiding this comment.
LGTM 👍
I’m just not sure about the change in behavior for errored requests 🤔
pkg/driver/node.go
Outdated
| log.Error("failed to publish volume, pod will start without this volume's content", "error", err) | ||
| return &csi.NodePublishVolumeResponse{}, nil | ||
| } | ||
|
|
||
| if resp == nil { | ||
| volumeCtx := req.GetVolumeContext() | ||
| metrics.RecordVolumeMountAttempt(volumeCtx["type"], req.GetTargetPath(), metrics.StatusUnsupported) | ||
| return nil, fmt.Errorf("unsupported volume type: %q", volumeCtx["type"]) | ||
| log.Error("unsupported volume type, pod will start without this volume's content", "type", volumeCtx["type"]) | ||
| return &csi.NodePublishVolumeResponse{}, nil |
There was a problem hiding this comment.
I think it might be worth not returning an error for “unpublish” requests, but I'm not sure we should do the same for “publish” ones 🤔
There was a problem hiding this comment.
Since we're guarding this in the admission controller, I think we can return an error on publish here.
What I didn't want (and what was the case before we added the admission controller change) was that the admission controller would mutate the pod, then the CSI driver would return an error, and the pod would fail to start.
IMO, it's preferable to lose instrumentation over causing application failures. The admission controller will prevent that by not mutating the pod.
There was a problem hiding this comment.
During a rollout, it might be safer to fail new pods and keep the existing ones running, rather than silently disabling instrumentation, which could lead to unexpected behavior or even break some applications.
WDYT @adel121
There was a problem hiding this comment.
I tend to agree more with @iamluc
When a pod fails to mount the volume, the user will easily notice.
If instrumentation is skipped, it will probably go unnoticed for some time.
But I think we should think of a more reliable solution so that if a pod is created at time T, we should impose the allowlist with the version that existed at time T. I think this is not a straightforward problem, but we can defer it for now while documenting the limitation.
There was a problem hiding this comment.
Ok, I can adjust this to return errors if pods launch with an image from a disallowed registry (both in the admission controller and the CSI Driver)
But I think we should think of a more reliable solution so that if a pod is created at time T, we should impose the allowlist with the version that existed at time T
I guess the concern here is:
Pod gets launched with image (success) -> Cluster gets reconfigured with allow list -> Pod continues to run until reschedule -> On reschedule, pod starts up and fails.
The way around this, I suppose, is to not have the CSI Driver validate this at all, and do it entirely in the admission controller - if the admission controller allowed it, it's allowed forever.
While this definitely is more reliable, I'm not sure I'd want this from a security standpoint. If I, as a customer, am trying to enforce some security requirement, I don't want to be able to accidentally have old pods floating around with disallowed images.
There was a problem hiding this comment.
The way around this, I suppose, is to not have the CSI Driver validate this at all, and do it entirely in the admission controller - if the admission controller allowed it, it's allowed forever.
What about pods mounting a Datadog CSI volume, but not being mutated by the admission controller?
I mean what if a user manually creates a pod that doesn't go through admission mutation, and requests a CSI volume with a registry that is not in the allowlist?
There was a problem hiding this comment.
I think the allowlist should be in the CSI driver because it provides a safety guarantee that the csi driver never pulls from a registry outside the allowlisted ones.
The admission controller has to know about the allowlist because it is a client to the CSI driver.
The way I see it, if I am an SRE managing the CSI driver:
-
When adding a new registry to the allowlist: it is simpler, we just wait for the CSI daemonset to rollout, and then announce that the new registry is now allowlisted.
-
When dropping a registry from the allowlist: we can deal with the transient state by first announcing the removal of the registry from the allowlist (but don't remove it yet), and after some time, we update the allowlist. In case a registry should be urgently blocked (because it is under attack for example), then sudden removal makes sense even if it breaks some pods.
Per discussion with @iamluc and @adel121: publish failures should return errors so that pods fail to start when the registry allow list rejects them. This is preferable to silently skipping instrumentation, which would go unnoticed. Unpublish failures continue to return success — cleanup should be graceful and not block pod deletion.
Summary
--registry-allow-listCLI flag (env:DD_REGISTRY_ALLOW_LIST) to restrict which OCI registries are permitted forDatadogLibraryvolumesTest plan
go test ./pkg/driver/publishers/... ./pkg/librarymanager/...passesgo build ./...passesDD_REGISTRY_ALLOW_LISTset and verify allowed registry succeedsDD_REGISTRY_ALLOW_LISTset and verify disallowed registry is rejectedCloses INPLAT-881