Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix updating nodes states #31

Merged
merged 4 commits into from
May 29, 2024
Merged

Conversation

jeffnvidia
Copy link
Contributor

@jeffnvidia jeffnvidia commented May 22, 2024

Summary

Based on main
This PR fixes a bug : cloudAI wasnt able to correctly access the state of the nodes (IDLE, ALLOCATED ...)
fixed the parse_sinfo_output function

Test Plan

  1. Test by @amaslenn
    CI
  2. Test by @jeffnvidia
    2.1 Slurm command generation

$ cloudaix --mode run --system_config_path conf/v0.6/general/system/israel_1.toml --test_scenario_path conf/v0.6/general/test_scenario/llama/llama.toml

Additional Notes

Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update
https://github.com/NVIDIA/cloudai/blob/main/tests/test_slurm_system.py
as well, so that we can understand which cases were not covered and are now covered?

tests/test_slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
@TaekyungHeo
Copy link
Member

Please fix failing CI pipelines as well.

@jeffnvidia jeffnvidia force-pushed the fix_node_fetch branch 3 times, most recently from df011de to ffbaea4 Compare May 23, 2024 15:10
tests/test_slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Show resolved Hide resolved
@jeffnvidia jeffnvidia force-pushed the fix_node_fetch branch 9 times, most recently from 6b5912d to 744d6c8 Compare May 27, 2024 11:39
@jeffnvidia jeffnvidia force-pushed the fix_node_fetch branch 2 times, most recently from 3495f69 to 8415b33 Compare May 28, 2024 08:15
tests/test_slurm_system.py Outdated Show resolved Hide resolved
src/cloudai/schema/system/slurm/slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
src/cloudai/schema/system/slurm/slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
tests/test_slurm_system.py Outdated Show resolved Hide resolved
@TaekyungHeo TaekyungHeo merged commit f2261f1 into NVIDIA:main May 29, 2024
2 checks passed
@jeffnvidia jeffnvidia deleted the fix_node_fetch branch July 30, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants