Add registry allow list for DatadogLibrary CSI volumes by knusbaum · Pull Request #72 · DataDog/datadog-csi-driver

knusbaum · 2026-03-17T18:19:39Z

Summary

Adds --registry-allow-list CLI flag (env: DD_REGISTRY_ALLOW_LIST) to restrict which OCI registries are permitted for DatadogLibrary volumes
If the list is non-empty, publish requests specifying an unlisted registry are rejected with an error
Empty list (default) allows all registries — fully backward compatible

Test plan

go test ./pkg/driver/publishers/... ./pkg/librarymanager/... passes
go build ./... passes
Deploy via Helm with DD_REGISTRY_ALLOW_LIST set and verify allowed registry succeeds
Deploy via Helm with DD_REGISTRY_ALLOW_LIST set and verify disallowed registry is rejected
Verify empty list (default) continues to allow all registries

Closes INPLAT-881

Adds a configurable allow list of trusted registries for DatadogLibrary volumes. When non-empty, publish requests specifying an unlisted registry are rejected. Empty list (default) allows all registries, preserving backward compatibility.

pkg/driver/driver.go

knusbaum · 2026-03-17T22:50:03Z

Helm chart PR: DataDog/helm-charts#2488

Fix two bugs in the registry allow list feature: 1. Viper's GetStringSlice treats a comma-separated env var as a single element. DD_REGISTRY_ALLOW_LIST=a,b was parsed as ["a,b"] instead of ["a","b"]. Add a helper that splits entries on commas. 2. NodePublishVolume returned a gRPC error when a publisher failed, which caused pods to be stuck in ContainerCreating indefinitely. Now log the error and return success so pods start without the volume content (graceful degradation).

iamluc

LGTM 👍
I’m just not sure about the change in behavior for errored requests 🤔

pkg/driver/driver.go

iamluc · 2026-03-20T14:36:24Z

pkg/driver/node.go

+		log.Error("failed to publish volume, pod will start without this volume's content", "error", err)
+		return &csi.NodePublishVolumeResponse{}, nil
 	}

 	if resp == nil {
 		volumeCtx := req.GetVolumeContext()
 		metrics.RecordVolumeMountAttempt(volumeCtx["type"], req.GetTargetPath(), metrics.StatusUnsupported)
-		return nil, fmt.Errorf("unsupported volume type: %q", volumeCtx["type"])
+		log.Error("unsupported volume type, pod will start without this volume's content", "type", volumeCtx["type"])
+		return &csi.NodePublishVolumeResponse{}, nil


I think it might be worth not returning an error for “unpublish” requests, but I'm not sure we should do the same for “publish” ones 🤔

Since we're guarding this in the admission controller, I think we can return an error on publish here.

What I didn't want (and what was the case before we added the admission controller change) was that the admission controller would mutate the pod, then the CSI driver would return an error, and the pod would fail to start.

IMO, it's preferable to lose instrumentation over causing application failures. The admission controller will prevent that by not mutating the pod.

During a rollout, it might be safer to fail new pods and keep the existing ones running, rather than silently disabling instrumentation, which could lead to unexpected behavior or even break some applications.

WDYT @adel121

I tend to agree more with @iamluc

When a pod fails to mount the volume, the user will easily notice.
If instrumentation is skipped, it will probably go unnoticed for some time.

But I think we should think of a more reliable solution so that if a pod is created at time T, we should impose the allowlist with the version that existed at time T. I think this is not a straightforward problem, but we can defer it for now while documenting the limitation.

Ok, I can adjust this to return errors if pods launch with an image from a disallowed registry (both in the admission controller and the CSI Driver)

But I think we should think of a more reliable solution so that if a pod is created at time T, we should impose the allowlist with the version that existed at time T

I guess the concern here is:
Pod gets launched with image (success) -> Cluster gets reconfigured with allow list -> Pod continues to run until reschedule -> On reschedule, pod starts up and fails.

The way around this, I suppose, is to not have the CSI Driver validate this at all, and do it entirely in the admission controller - if the admission controller allowed it, it's allowed forever.

While this definitely is more reliable, I'm not sure I'd want this from a security standpoint. If I, as a customer, am trying to enforce some security requirement, I don't want to be able to accidentally have old pods floating around with disallowed images.

The way around this, I suppose, is to not have the CSI Driver validate this at all, and do it entirely in the admission controller - if the admission controller allowed it, it's allowed forever.

What about pods mounting a Datadog CSI volume, but not being mutated by the admission controller?

I mean what if a user manually creates a pod that doesn't go through admission mutation, and requests a CSI volume with a registry that is not in the allowlist?

I think the allowlist should be in the CSI driver because it provides a safety guarantee that the csi driver never pulls from a registry outside the allowlisted ones.

The admission controller has to know about the allowlist because it is a client to the CSI driver.

The way I see it, if I am an SRE managing the CSI driver:

When adding a new registry to the allowlist: it is simpler, we just wait for the CSI daemonset to rollout, and then announce that the new registry is now allowlisted.

When dropping a registry from the allowlist: we can deal with the transient state by first announcing the removal of the registry from the allowlist (but don't remove it yet), and after some time, we update the allowlist. In case a registry should be urgently blocked (because it is under attack for example), then sudden removal makes sense even if it breaks some pods.

@iamluc

Per discussion with @iamluc and @adel121: publish failures should return errors so that pods fail to start when the registry allow list rejects them. This is preferable to silently skipping instrumentation, which would go unnoticed. Unpublish failures continue to return success — cleanup should be graceful and not block pod deletion.

knusbaum added 2 commits March 17, 2026 13:17

cleanup

5dc016f

knusbaum commented Mar 17, 2026

View reviewed changes

pkg/driver/driver.go Show resolved Hide resolved

knusbaum marked this pull request as ready for review March 18, 2026 20:41

knusbaum requested a review from a team as a code owner March 18, 2026 20:41

iamluc previously approved these changes Mar 20, 2026

View reviewed changes

knusbaum dismissed iamluc’s stale review via 2cb7a2d March 31, 2026 16:25

Revert unpublish error handling change (unrelated to allow list)

257bf37

iamluc approved these changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add registry allow list for DatadogLibrary CSI volumes#72

Add registry allow list for DatadogLibrary CSI volumes#72
knusbaum wants to merge 5 commits intomainfrom
knusbaum/registry-allow-list

knusbaum commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

knusbaum commented Mar 17, 2026

Uh oh!

iamluc left a comment

Uh oh!

Uh oh!

iamluc Mar 20, 2026

Uh oh!

knusbaum Mar 26, 2026

Uh oh!

iamluc Mar 27, 2026

Uh oh!

adel121 Mar 30, 2026

Uh oh!

knusbaum Mar 30, 2026

Uh oh!

adel121 Mar 31, 2026

Uh oh!

adel121 Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

knusbaum commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

knusbaum commented Mar 17, 2026

Uh oh!

iamluc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iamluc Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

knusbaum Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

iamluc Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

adel121 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

knusbaum Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

adel121 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

adel121 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

knusbaum commented Mar 17, 2026 •

edited

Loading