-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Flyte agent to provision and manage K8s (data) service for deep learning (GNN) use cases #3004
Add the Flyte agent to provision and manage K8s (data) service for deep learning (GNN) use cases #3004
Conversation
983dc2a
to
944a500
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3004 +/- ##
==========================================
- Coverage 83.47% 75.71% -7.77%
==========================================
Files 319 202 -117
Lines 26427 21430 -4997
Branches 2744 2760 +16
==========================================
- Hits 22060 16225 -5835
- Misses 3591 4392 +801
- Partials 776 813 +37 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing!!! leave some minor comments
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/k8s_ops/k8s-service-agent-rolebinding.yaml
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
Outdated
Show resolved
Hide resolved
a0c5d8e
to
ec6d4c1
Compare
plugins/flytekit-k8sdataservice/k8s_ops/k8s-service-agent-role.yaml
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/k8s_ops/k8s-service-agent-rolebinding.yaml
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/__init__.py
Outdated
Show resolved
Hide resolved
a2e628f
to
43e2733
Compare
Code Review Agent Run #a24be6Actionable Suggestions - 16
Additional Suggestions - 10
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
Outdated
Show resolved
Hide resolved
tests/flytekit/integration/remote/workflows/basic/attr_access_sd.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Shuying Liang <[email protected]>
* Fix pydantic default input Signed-off-by: Future-Outlier <[email protected]> * add pydantic integration test Signed-off-by: Future-Outlier <[email protected]> * Use duck typing by Thomas's advice Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> * lint Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
* fix: Open FlyteFile from remote path Signed-off-by: JiaWei Jiang <[email protected]> * Add integration test Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Use ctx as param instead of recreation Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Clean test logic 1. Remove redundant prints 2. Use `mock.patch.dict` to setup `os.environ` for the current test fn * Avoid contaminating other tests running in the same process Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Setup local path and downloader in constructor Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Move SimpleFileTransfer to an utility file Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant env var setup Please refer to flyteorg#3001 Signed-off-by: JiaWei Jiang <[email protected]> * test: Add another ff use case Create ff in one task pod and read it in another task pod. Signed-off-by: JiaWei Jiang <[email protected]> --------- Signed-off-by: JiaWei Jiang <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
* test: Add integration test for attr access of sd Signed-off-by: JiaWei Jiang <[email protected]> * Correct file path Signed-off-by: JiaWei Jiang <[email protected]> * test: Support interaction with minio s3 bucket 1. Upload a local parquet file to minio s3 bucket 2. Access StructuredDataset attr from a dataclass 3. Open StructuredDataset from a remote path Signed-off-by: JiaWei Jiang <[email protected]> * Delete an unmerged integration test Signed-off-by: JiaWei Jiang <[email protected]> * Try imagespec with commit sha of corresponding fix Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant test Signed-off-by: JiaWei Jiang <[email protected]> * Remove default_factory and create sd dc from input uri Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Clean test logic 1. Remove redundant prints 2. Use `mock.patch.dict` to setup `os.environ` for the current test fn * Avoid contaminating other tests running in the same process Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant minio env var setup and add test comments Signed-off-by: JiaWei Jiang <[email protected]> * Support uploading tmp pqt file Signed-off-by: JiaWei Jiang <[email protected]> * Udpate deprecated module Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant and unused imports Signed-off-by: JiaWei Jiang <[email protected]> --------- Signed-off-by: JiaWei Jiang <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
…rg#3043) Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
* make _downloader function in FlyteFile/Directory pickleable Signed-off-by: Niels Bantilan <[email protected]> * make FlyteFile and Directory pickleable Signed-off-by: Niels Bantilan <[email protected]> * remove unnecessary helper functions Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> * use partials instead of lambda Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> * remove unneeded helper function Signed-off-by: Niels Bantilan <[email protected]> * update FlyteFilePathTransformer.downloader method Signed-off-by: Niels Bantilan <[email protected]> * remove downloader staticmethod Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> --------- Signed-off-by: Niels Bantilan <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
96e7ddf
to
7ccdeeb
Compare
Code Review Agent Run #62dbf3Actionable Suggestions - 5
Additional Suggestions - 2
Review Details
|
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py
Show resolved
Hide resolved
Code Review Agent Run #839c30Actionable Suggestions - 0Review Details
|
Why are the changes needed?
Graph Neural Networks are critical for understanding complex relationships across LinkedIn's professional networks. However, training these models at scale involves intricate data loading, sampling, and processing across multiple nodes and GPUs. The missing piece is the infrastructure to support how and where to run these Kubernetes data services, making them scalable and reliable along with the training or inference processes.
To simplify the complex orchestration pipeline, we decided to leverage flyte agent framework to provision and manage the data services for GNN use case.
What changes were proposed in this pull request?
This PR adds the flyte agent to create/update/delete the K8s statefulset and service.
How was this patch tested?
MPIJobs
(for deep learning GNN training) orTFJob
(for offline inference)Setup process
pip install flytekitplugins-k8sdataservice
Screenshots
Check all the applicable boxes
Docs link
Blog from Flyte community sync
Summary by Bito
This PR introduces a new Flyte plugin for Kubernetes data services, focusing on GNN training workloads with DataServiceAgent, K8sManager, and CleanupSensor components. The implementation includes test suite improvements for Python version compatibility, replacing asyncio.TaskGroup with asyncio.wait_for in sensor tests, and cleanup of unused mock decorators while maintaining functionality.Unit tests added: True
Estimated effort to review (1-5, lower is better): 5