Skip to content

Add registry allow list for DatadogLibrary CSI volumes#72

Open
knusbaum wants to merge 5 commits intomainfrom
knusbaum/registry-allow-list
Open

Add registry allow list for DatadogLibrary CSI volumes#72
knusbaum wants to merge 5 commits intomainfrom
knusbaum/registry-allow-list

Conversation

@knusbaum
Copy link
Copy Markdown

@knusbaum knusbaum commented Mar 17, 2026

Summary

  • Adds --registry-allow-list CLI flag (env: DD_REGISTRY_ALLOW_LIST) to restrict which OCI registries are permitted for DatadogLibrary volumes
  • If the list is non-empty, publish requests specifying an unlisted registry are rejected with an error
  • Empty list (default) allows all registries — fully backward compatible

Test plan

  • go test ./pkg/driver/publishers/... ./pkg/librarymanager/... passes
  • go build ./... passes
  • Deploy via Helm with DD_REGISTRY_ALLOW_LIST set and verify allowed registry succeeds
  • Deploy via Helm with DD_REGISTRY_ALLOW_LIST set and verify disallowed registry is rejected
  • Verify empty list (default) continues to allow all registries

Closes INPLAT-881

Adds a configurable allow list of trusted registries for DatadogLibrary
volumes. When non-empty, publish requests specifying an unlisted registry
are rejected. Empty list (default) allows all registries, preserving
backward compatibility.
@knusbaum
Copy link
Copy Markdown
Author

Helm chart PR: DataDog/helm-charts#2488

Fix two bugs in the registry allow list feature:

1. Viper's GetStringSlice treats a comma-separated env var as a single
   element. DD_REGISTRY_ALLOW_LIST=a,b was parsed as ["a,b"] instead of
   ["a","b"]. Add a helper that splits entries on commas.

2. NodePublishVolume returned a gRPC error when a publisher failed,
   which caused pods to be stuck in ContainerCreating indefinitely.
   Now log the error and return success so pods start without the
   volume content (graceful degradation).
@knusbaum knusbaum marked this pull request as ready for review March 18, 2026 20:41
@knusbaum knusbaum requested a review from a team as a code owner March 18, 2026 20:41
iamluc
iamluc previously approved these changes Mar 20, 2026
Copy link
Copy Markdown
Contributor

@iamluc iamluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
I’m just not sure about the change in behavior for errored requests 🤔

Comment on lines +37 to +45
log.Error("failed to publish volume, pod will start without this volume's content", "error", err)
return &csi.NodePublishVolumeResponse{}, nil
}

if resp == nil {
volumeCtx := req.GetVolumeContext()
metrics.RecordVolumeMountAttempt(volumeCtx["type"], req.GetTargetPath(), metrics.StatusUnsupported)
return nil, fmt.Errorf("unsupported volume type: %q", volumeCtx["type"])
log.Error("unsupported volume type, pod will start without this volume's content", "type", volumeCtx["type"])
return &csi.NodePublishVolumeResponse{}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth not returning an error for “unpublish” requests, but I'm not sure we should do the same for “publish” ones 🤔

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're guarding this in the admission controller, I think we can return an error on publish here.

What I didn't want (and what was the case before we added the admission controller change) was that the admission controller would mutate the pod, then the CSI driver would return an error, and the pod would fail to start.

IMO, it's preferable to lose instrumentation over causing application failures. The admission controller will prevent that by not mutating the pod.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During a rollout, it might be safer to fail new pods and keep the existing ones running, rather than silently disabling instrumentation, which could lead to unexpected behavior or even break some applications.

WDYT @adel121

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree more with @iamluc

When a pod fails to mount the volume, the user will easily notice.
If instrumentation is skipped, it will probably go unnoticed for some time.

But I think we should think of a more reliable solution so that if a pod is created at time T, we should impose the allowlist with the version that existed at time T. I think this is not a straightforward problem, but we can defer it for now while documenting the limitation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can adjust this to return errors if pods launch with an image from a disallowed registry (both in the admission controller and the CSI Driver)

But I think we should think of a more reliable solution so that if a pod is created at time T, we should impose the allowlist with the version that existed at time T

I guess the concern here is:
Pod gets launched with image (success) -> Cluster gets reconfigured with allow list -> Pod continues to run until reschedule -> On reschedule, pod starts up and fails.

The way around this, I suppose, is to not have the CSI Driver validate this at all, and do it entirely in the admission controller - if the admission controller allowed it, it's allowed forever.

While this definitely is more reliable, I'm not sure I'd want this from a security standpoint. If I, as a customer, am trying to enforce some security requirement, I don't want to be able to accidentally have old pods floating around with disallowed images.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way around this, I suppose, is to not have the CSI Driver validate this at all, and do it entirely in the admission controller - if the admission controller allowed it, it's allowed forever.

What about pods mounting a Datadog CSI volume, but not being mutated by the admission controller?

I mean what if a user manually creates a pod that doesn't go through admission mutation, and requests a CSI volume with a registry that is not in the allowlist?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the allowlist should be in the CSI driver because it provides a safety guarantee that the csi driver never pulls from a registry outside the allowlisted ones.

The admission controller has to know about the allowlist because it is a client to the CSI driver.

The way I see it, if I am an SRE managing the CSI driver:

  • When adding a new registry to the allowlist: it is simpler, we just wait for the CSI daemonset to rollout, and then announce that the new registry is now allowlisted.

  • When dropping a registry from the allowlist: we can deal with the transient state by first announcing the removal of the registry from the allowlist (but don't remove it yet), and after some time, we update the allowlist. In case a registry should be urgently blocked (because it is under attack for example), then sudden removal makes sense even if it breaks some pods.

Per discussion with @iamluc and @adel121: publish failures should return
errors so that pods fail to start when the registry allow list rejects
them. This is preferable to silently skipping instrumentation, which
would go unnoticed.

Unpublish failures continue to return success — cleanup should be
graceful and not block pod deletion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants