Improve AutoTuner cluster configuration recommendations for GPU runs #1501
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1121
Fixes #1334
Summary
Introduces a new strategy-based approach for recommending GPU cluster configurations (num executors, core count and num worker nodes (CSPs only)).
Key Changes
Reasoning
Code Changes
Cluster configuration strategy:
core/src/main/scala/com/nvidia/spark/rapids/tool/ClusterConfigurationStrategy.scala
: Added new classesClusterConfigurationStrategy
,ClusterPropertyBasedStrategy
, andEventLogBasedStrategy
to encapsulate different cluster configuration strategies.core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala
: Refactored thePlatform
class to use the new strategy classes for recommending cluster configurations and removed redundant methods. [1] [2] [3]Enhancements to instance information handling:
core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala
: AddedgetMemoryPerExec
method toInstanceInfo
and created a companion object for default instance creation.core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala
: UpdatedDatabricksAwsPlatform
to usegetInstanceByResourcesMap
for instance type mapping.Default recommendations and properties:
core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala
: Added default recommendations for cores per executor and GPUs per node.