-
Notifications
You must be signed in to change notification settings - Fork 25
feat: add MPS GPU sharing settings model #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Add NvidiaMpsSettings struct and Mps variant to NvidiaDeviceSharingStrategy for NVIDIA Multi-Process Service support. Includes validation to prevent MPS and MIG from being enabled simultaneously. Signed-off-by: Matthew Yeazel <[email protected]>
piyush-jena
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from the nitpicky request for an additional unit test. Nice work coding the conflict between MIG and MPS.
| let test_json = r#"{"nvidia":{"pass-device-specs":true,"device-id-strategy":"index","device-list-strategy":"volume-mounts","device-sharing-strategy":"mps","mps":{"replicas":4},"device-partitioning-strategy":"none"}}"#; | ||
|
|
||
| let device_plugins: KubeletDevicePluginsV1 = serde_json::from_str(test_json).unwrap(); | ||
| assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an assert for replicas? and an unit test for the default number of replicas = 2?
| // Define the bounds for the `mps.replicas` field | ||
| const MPS_REPLICAS_MIN: i32 = 2; | ||
| const MPS_REPLICAS_MAX: i32 = i32::MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are actual bounds? It's not possible to set only one replica, and the software is fine with 2 billion replicas?
| fn validate(value: Self, _validated_settings: Option<serde_json::Value>) -> Result<()> { | ||
| // Validate MPS and MIG are not both enabled | ||
| if let Some(ref nvidia) = value.nvidia { | ||
| let is_mps = matches!( | ||
| nvidia.device_sharing_strategy, | ||
| Some(NvidiaDeviceSharingStrategy::Mps) | ||
| ); | ||
| let is_mig = matches!( | ||
| nvidia.device_partitioning_strategy, | ||
| Some(NvidiaDevicePartitioningStrategy::MIG) | ||
| ); | ||
| if is_mps && is_mig { | ||
| return Err(KubeletDevicePluginsError::MpsMigConflict); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this validation logic actually work? AFAIK it's the first time we're attempting this sort of cross-setting validation.
Issue #: bottlerocket-os/bottlerocket#4673
Description of changes:
Add NvidiaMpsSettings struct and Mps variant to NvidiaDeviceSharingStrategy for NVIDIA Multi-Process Service support. Includes validation to prevent MPS and MIG from being enabled simultaneously.
Testing:
The testing was documented in the related PR: bottlerocket-os/bottlerocket-core-kit#789
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.