Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AutoTuner cluster configuration recommendations for GPU runs #1501

Merged
merged 6 commits into from
Jan 22, 2025

Conversation

parthosa
Copy link
Collaborator

Fixes #1121
Fixes #1334

Summary

Introduces a new strategy-based approach for recommending GPU cluster configurations (num executors, core count and num worker nodes (CSPs only)).

Key Changes

  • Standardized executor core count to 16 cores per executor for GPU runs.
  • Shifted from multi-GPU instances to smaller, single-GPU instances
  • Implemented two configuration strategies:
    1. Cluster Property Strategy: Generates recommendations based on user-specified cluster properties
    2. Event Log Strategy: Generates recommendations based on event logs

Reasoning

  • Analysis of NDS performance metrics shows 16 cores/executor provides optimal performance
  • Configurations with 4 or 64 cores/executor showed reduced performance and cost efficiency
  • Smaller instances with distributed GPUs improve:
    • System fault tolerance
    • I/O performance (both disk and network)
    • Resource utilization

Code Changes

Cluster configuration strategy:

Enhancements to instance information handling:

Default recommendations and properties:

Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa added bug Something isn't working core_tools Scope the core module (scala) labels Jan 13, 2025
@parthosa parthosa requested a review from tgravescs January 13, 2025 22:28
@parthosa parthosa self-assigned this Jan 13, 2025
@parthosa
Copy link
Collaborator Author

parthosa commented Jan 13, 2025

WIP: AutoTuner tests need to be updated with the new logic.

@parthosa parthosa marked this pull request as ready for review January 15, 2025 01:01
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME
Thanks @parthosa

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa
LGTME

@parthosa parthosa merged commit 4a32581 into NVIDIA:dev Jan 22, 2025
13 checks passed
@parthosa parthosa deleted the spark-rapids-tools-1121 branch January 22, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
2 participants