gpu: fix operator deployment instructions #20552

gjulianm · 2025-06-19T11:39:14Z

What does this PR do?

Updates the GPU deployment instructions to fix the deployment with the Datadog operator.

Motivation

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

janine-c

Looks great! I made some minor writing suggestions to try to make the directions a bit easier to parse and to follow our style guide a little more closely. If you have any questions, don't hesitate to let me know!

gpu/README.md

janine-c · 2025-06-19T17:43:03Z

gpu/README.md

+```yaml
+spec:
+  features:
+    oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods


Suggested change

oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods

oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods

why do we need to enable oomkill for SP? enabling GPUM is not enough?

The problem is that we need the operator to enable system-probe for all pods, as the DAP does not have any effect on whether it's enabled or not. So we need to enable something that enables system-probe everywhere, oomkill seemed like a good candidate.

so why not enable gpum then? it is also a module in system-probe

These are the instructions for the mixed environments, so we can't enable GPUM everywhere.

discussed that on slack, resolving for now

val06

reviewed

val06 · 2025-06-20T08:20:29Z

gpu/README.md

+```yaml
+spec:
+  features:
+    oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods


why do we need to enable oomkill for SP? enabling GPUM is not enough?

gpu/README.md

Review from janine-c is dismissed. Related teams and files:

documentation
- gpu/README.md

Co-authored-by: Janine Chan <[email protected]>

janine-c

Looks great! I have one very minor suggestion for flow, but otherwise am a fan 🙂

janine-c · 2025-06-27T17:18:52Z

gpu/README.md

-For **mixed environments**, use the [DatadogAgentProfiles feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. In this case, it is not necessary to modify the DatadogAgent manifest. Instead, create a profile that enables the configuration on GPU nodes only:
+For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled. For more information, see [Enabling DatadogAgentProfiles](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).
+
+Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:


Suggested change

Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:

Additionally, modify the DatadogAgent manifest to enable certain features that are not supported by the DAP yet:

☝🏻 Suggested wording to try to make these sections flow better together. Also, I noticed that we use both the phrases the DatadogAgent manifest and the DatadogAgent configuration - do they refer to the same thing? If so, it probably makes sense to use consistent terminology instead.

Update operator instructions

0c94647

gjulianm self-assigned this Jun 19, 2025

temporal-github-worker-1 bot added docs/review-requested ecosystems/review-requested product/review-requested labels Jun 19, 2025

datadog-agent-integrations-bot bot added documentation integration/gpu labels Jun 19, 2025

gjulianm marked this pull request as ready for review June 19, 2025 11:39

gjulianm requested review from a team as code owners June 19, 2025 11:39

datadog-agent-integrations-bot bot added team/documentation team/ebpf-platform labels Jun 19, 2025

gjulianm added the qa/skip-qa Automatically skip this PR for the next QA label Jun 19, 2025

janine-c previously approved these changes Jun 19, 2025

View reviewed changes

temporal-github-worker-1 bot added docs/approved and removed docs/review-requested labels Jun 19, 2025

val06 requested changes Jun 20, 2025

View reviewed changes

Fix read-only paths

c387439

temporal-github-worker-1 bot added docs/review-requested and removed docs/approved labels Jun 23, 2025

gjulianm and others added 3 commits June 23, 2025 15:11

Update gpu/README.md

3feb791

Co-authored-by: Janine Chan <[email protected]>

Update gpu/README.md

5848602

Co-authored-by: Janine Chan <[email protected]>

Update gpu/README.md

b13b272

Co-authored-by: Janine Chan <[email protected]>

val06 approved these changes Jun 24, 2025

View reviewed changes

janine-c approved these changes Jun 27, 2025

View reviewed changes

temporal-github-worker-1 bot added docs/approved and removed docs/review-requested labels Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu: fix operator deployment instructions #20552

gpu: fix operator deployment instructions #20552

Uh oh!

gjulianm commented Jun 19, 2025 •

edited

Loading

Uh oh!

janine-c left a comment

Uh oh!

Uh oh!

Uh oh!

janine-c Jun 19, 2025

Uh oh!

val06 Jun 20, 2025

Uh oh!

gjulianm Jun 23, 2025

Uh oh!

val06 Jun 23, 2025

Uh oh!

gjulianm Jun 24, 2025

Uh oh!

val06 Jun 24, 2025

Uh oh!

val06 left a comment

Uh oh!

val06 Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janine-c left a comment

Uh oh!

janine-c Jun 27, 2025

Uh oh!

Uh oh!

	oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
	oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods

	Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:
	Additionally, modify the DatadogAgent manifest to enable certain features that are not supported by the DAP yet:

gpu: fix operator deployment instructions #20552

Are you sure you want to change the base?

gpu: fix operator deployment instructions #20552

Uh oh!

Conversation

gjulianm commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

Uh oh!

janine-c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

val06 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janine-c left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjulianm commented Jun 19, 2025 •

edited

Loading