Skip to content

gpu: fix operator deployment instructions #20552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gjulianm
Copy link
Contributor

@gjulianm gjulianm commented Jun 19, 2025

What does this PR do?

Updates the GPU deployment instructions to fix the deployment with the Datadog operator.

Motivation

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

janine-c
janine-c previously approved these changes Jun 19, 2025
Copy link
Contributor

@janine-c janine-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I made some minor writing suggestions to try to make the directions a bit easier to parse and to follow our style guide a little more closely. If you have any questions, don't hesitate to let me know!

gpu/README.md Outdated
```yaml
spec:
features:
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to enable oomkill for SP? enabling GPUM is not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we need the operator to enable system-probe for all pods, as the DAP does not have any effect on whether it's enabled or not. So we need to enable something that enables system-probe everywhere, oomkill seemed like a good candidate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why not enable gpum then? it is also a module in system-probe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the instructions for the mixed environments, so we can't enable GPUM everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed that on slack, resolving for now

Copy link
Contributor

@val06 val06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed

gpu/README.md Outdated
```yaml
spec:
features:
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to enable oomkill for SP? enabling GPUM is not enough?

@temporal-github-worker-1 temporal-github-worker-1 bot dismissed janine-c’s stale review June 23, 2025 10:59

Review from janine-c is dismissed. Related teams and files:

  • documentation
    • gpu/README.md
gjulianm and others added 3 commits June 23, 2025 15:11
Co-authored-by: Janine Chan <[email protected]>
Co-authored-by: Janine Chan <[email protected]>
Co-authored-by: Janine Chan <[email protected]>
Copy link
Contributor

@janine-c janine-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I have one very minor suggestion for flow, but otherwise am a fan 🙂

For **mixed environments**, use the [DatadogAgentProfiles feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. In this case, it is not necessary to modify the DatadogAgent manifest. Instead, create a profile that enables the configuration on GPU nodes only:
For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled. For more information, see [Enabling DatadogAgentProfiles](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).

Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:
Additionally, modify the DatadogAgent manifest to enable certain features that are not supported by the DAP yet:

☝🏻 Suggested wording to try to make these sections flow better together. Also, I noticed that we use both the phrases the DatadogAgent manifest and the DatadogAgent configuration - do they refer to the same thing? If so, it probably makes sense to use consistent terminology instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants