Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Bugs in K8s System Class Introduced During the Adoption of Pydantic #196

Merged
merged 14 commits into from
Sep 19, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Sep 17, 2024

Summary

  • Pydantic does not support __post_init__. Member variables were not initialized. Replaced __post_init__ with __init__.
  • Added monitor_interval.
  • Added docstrings for attributes to ensure completeness.
  • Fixed a bug when loading Kubernetes configurations.
  • Updated unit tests accordingly.

Test Plan

  1. CI passes
  2. Tested on K8s
$ python cloudaix.py --mode run --system-config ../cloudai/conf/common/system/kubernetes_cluster.toml --test-templates-dir conf/staging/kubernetes/test_template/ --tests-dir conf/staging/kubernetes/test --test-scenario conf/staging/kubernetes/test_scenario/nccl_test.toml
...
[INFO] Logs for pod 'tests-1-launcher-f2wfl' saved to 'results/nccl-test_2024-09-17_14-13-20/Tests.1/0/tests-1-launcher-f2wfl.txt'
[INFO] Logs for pod 'tests-1-worker-0' saved to 'results/nccl-test_2024-09-17_14-13-20/Tests.1/0/tests-1-worker-0.txt'
[INFO] Logs for pod 'tests-1-worker-1' saved to 'results/nccl-test_2024-09-17_14-13-20/Tests.1/0/tests-1-worker-1.txt'
[INFO] All logs concatenated and saved to 'results/nccl-test_2024-09-17_14-13-20/Tests.1/0/stdout.txt'
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: results/nccl-test_2024-09-17_14-13-20

@TaekyungHeo TaekyungHeo added bug Something isn't working Oct24 Oct'24 release feature labels Sep 17, 2024
@TaekyungHeo TaekyungHeo force-pushed the k8s-bug-fix branch 3 times, most recently from 506380d to 30e31b6 Compare September 17, 2024 14:19
@TaekyungHeo TaekyungHeo marked this pull request as ready for review September 17, 2024 14:20
@TaekyungHeo TaekyungHeo marked this pull request as draft September 17, 2024 14:21
@TaekyungHeo TaekyungHeo marked this pull request as ready for review September 17, 2024 14:29
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Andrei Maslennikov <[email protected]>
tests/test_kubernetes_system.py Outdated Show resolved Hide resolved
tests/test_kubernetes_system.py Outdated Show resolved Hide resolved
amaslenn
amaslenn previously approved these changes Sep 19, 2024
tests/test_kubernetes_system.py Outdated Show resolved Hide resolved
Co-authored-by: Andrei Maslennikov <[email protected]>
@TaekyungHeo TaekyungHeo merged commit f46d0de into NVIDIA:main Sep 19, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants