Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Flyte agent to provision and manage K8s (data) service for deep learning (GNN) use cases #3004

Merged
merged 20 commits into from
Jan 23, 2025

Conversation

shuyingliang
Copy link
Contributor

@shuyingliang shuyingliang commented Dec 14, 2024

Why are the changes needed?

Graph Neural Networks are critical for understanding complex relationships across LinkedIn's professional networks. However, training these models at scale involves intricate data loading, sampling, and processing across multiple nodes and GPUs. The missing piece is the infrastructure to support how and where to run these Kubernetes data services, making them scalable and reliable along with the training or inference processes.

To simplify the complex orchestration pipeline, we decided to leverage flyte agent framework to provision and manage the data services for GNN use case.

What changes were proposed in this pull request?

This PR adds the flyte agent to create/update/delete the K8s statefulset and service.

How was this patch tested?

  • The same code (with removed company related internal environments and set up) has been running in production along with the training job MPIJobs (for deep learning GNN training) or TFJob (for offline inference)
  • This is also tested in local sandbox

Setup process

pip install flytekitplugins-k8sdataservice

Screenshots

Screenshot 2024-11-11 at 3 48 18 PM

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Docs link

Blog from Flyte community sync

Summary by Bito

This PR introduces a new Flyte plugin for Kubernetes data services, focusing on GNN training workloads with DataServiceAgent, K8sManager, and CleanupSensor components. The implementation includes test suite improvements for Python version compatibility, replacing asyncio.TaskGroup with asyncio.wait_for in sensor tests, and cleanup of unused mock decorators while maintaining functionality.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 5

Copy link

codecov bot commented Dec 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.71%. Comparing base (dfa8f04) to head (a848dc5).
Report is 12 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (dfa8f04) and HEAD (a848dc5). Click for more details.

HEAD has 49 uploads less than BASE
Flag BASE (dfa8f04) HEAD (a848dc5)
53 4
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3004      +/-   ##
==========================================
- Coverage   83.47%   75.71%   -7.77%     
==========================================
  Files         319      202     -117     
  Lines       26427    21430    -4997     
  Branches     2744     2760      +16     
==========================================
- Hits        22060    16225    -5835     
- Misses       3591     4392     +801     
- Partials      776      813      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!!! leave some minor comments

@shuyingliang shuyingliang force-pushed the shuliang/k8sdataservice branch 6 times, most recently from a0c5d8e to ec6d4c1 Compare December 20, 2024 05:10
@shuyingliang shuyingliang force-pushed the shuliang/k8sdataservice branch from a2e628f to 43e2733 Compare January 11, 2025 03:44
@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 11, 2025

Code Review Agent Run #a24be6

Actionable Suggestions - 16
  • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py - 2
  • flytekit/image_spec/default_builder.py - 2
    • Consider adding uv.lock validation checks · Line 179-182
    • Consider validating pyproject.toml file existence · Line 216-217
  • flytekit/core/environment.py - 1
    • Consider splitting complex call method · Line 67-86
  • tests/flytekit/unit/core/test_environment.py - 1
    • Consider adding assertions to test case · Line 74-78
  • tests/flytekit/integration/remote/workflows/basic/attr_access_sd.py - 1
    • Consider adding error handling for StructuredDataset · Line 34-34
  • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py - 1
    • Unused parameters and missing annotations · Line 28-30
  • plugins/flytekit-k8sdataservice/utils/infra.py - 1
    • Insecure hash function usage · Line 7-7
  • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/sensor.py - 1
    • Consider moving k8s client initialization · Line 27-35
  • plugins/flytekit-k8sdataservice/setup.py - 1
    • Consider pinning exact dependency versions · Line 7-7
  • flytekit/core/local_cache.py - 1
    • Consider adding error handling for serialization · Line 116-116
  • plugins/flytekit-k8sdataservice/utils/resources.py - 1
    • Consider exact key matching for mem · Line 23-23
  • plugins/flytekit-k8sdataservice/tests/k8sdataservice/k8s/test_manager.py - 1
    • Improve error handling test coverage · Line 41-46
  • flytekit/types/directory/types.py - 1
    • Consider using relative paths for portability · Line 371-373
  • flytekit/core/array_node.py - 1
    • Consider execution mode initialization removal impact · Line 64-64
Additional Suggestions - 10
  • flytekit/core/environment.py - 1
    • Consider refactoring duplicated task logic · Line 95-115
  • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py - 1
    • Improve stateful set error handling · Line 47-48
  • flytekit/image_spec/default_builder.py - 1
  • flytekit/core/local_cache.py - 1
    • Consider updating test case for consistency · Line 28-28
  • plugins/flytekit-k8sdataservice/tests/k8sdataservice/k8s/test_manager.py - 1
    • Consider adding edge cases to test · Line 91-96
  • tests/flytekit/unit/types/directory/test_listdir.py - 1
    • Consider using context manager for tempdir · Line 9-10
  • tests/flytekit/unit/core/image_spec/test_default_builder.py - 3
  • plugins/flytekit-k8sdataservice/utils/resources.py - 1
    • Consider extracting zero check utility function · Line 7-14
Review Details
  • Files reviewed - 63 · Commit Range: 158469e..43e2733
    • .pre-commit-config.yaml
    • Dockerfile.agent
    • dev-requirements.txt
    • docs/source/plugins/k8sstatefuldataservice.rst
    • flytekit/__init__.py
    • flytekit/clis/sdk_in_container/run.py
    • flytekit/core/array_node.py
    • flytekit/core/array_node_map_task.py
    • flytekit/core/environment.py
    • flytekit/core/local_cache.py
    • flytekit/core/workflow.py
    • flytekit/extend/backend/agent_service.py
    • flytekit/extend/backend/base_agent.py
    • flytekit/image_spec/default_builder.py
    • flytekit/models/core/workflow.py
    • flytekit/models/literals.py
    • flytekit/remote/remote.py
    • flytekit/tools/translator.py
    • flytekit/types/directory/types.py
    • flytekit/types/file/file.py
    • plugins/flytekit-airflow/setup.py
    • plugins/flytekit-inference/flytekitplugins/inference/__init__.py
    • plugins/flytekit-inference/flytekitplugins/inference/vllm/serve.py
    • plugins/flytekit-inference/setup.py
    • plugins/flytekit-inference/tests/test_vllm.py
    • plugins/flytekit-k8sdataservice/dev-requirements.txt
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/__init__.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/kube_config.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/sensor.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/task.py
    • plugins/flytekit-k8sdataservice/setup.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/k8s/test_kube_config.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/k8s/test_manager.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_sensor.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_task.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/utils/test_resources.py
    • plugins/flytekit-k8sdataservice/utils/infra.py
    • plugins/flytekit-k8sdataservice/utils/resources.py
    • plugins/flytekit-onnx-pytorch/dev-requirements.txt
    • plugins/flytekit-optuna/flytekitplugins/optuna/__init__.py
    • plugins/flytekit-optuna/flytekitplugins/optuna/optimizer.py
    • plugins/flytekit-optuna/setup.py
    • plugins/flytekit-optuna/tests/test_optimizer.py
    • plugins/flytekit-spark/tests/test_environment.py
    • plugins/setup.py
    • pyproject.toml
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/integration/remote/utils.py
    • tests/flytekit/integration/remote/workflows/basic/attr_access_sd.py
    • tests/flytekit/integration/remote/workflows/basic/flytefile.py
    • tests/flytekit/integration/remote/workflows/basic/pydantic_wf.py
    • tests/flytekit/unit/core/image_spec/test_default_builder.py
    • tests/flytekit/unit/core/test_environment.py
    • tests/flytekit/unit/core/test_flyte_directory.py
    • tests/flytekit/unit/core/test_flyte_file.py
    • tests/flytekit/unit/core/test_generice_idl_type_engine.py
    • tests/flytekit/unit/core/test_local_cache.py
    • tests/flytekit/unit/core/test_type_engine.py
    • tests/flytekit/unit/core/test_workflows.py
    • tests/flytekit/unit/types/directory/test_listdir.py
  • Files skipped - 4
    • .github/workflows/pythonbuild.yml - Reason: Filter setting
    • plugins/flytekit-inference/README.md - Reason: Filter setting
    • plugins/flytekit-k8sdataservice/README.md - Reason: Filter setting
    • plugins/flytekit-optuna/README.md - Reason: Filter setting
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 11, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
New Feature - K8s Data Service Plugin Implementation

agent.py - Implements DataServiceAgent for managing K8s data service lifecycle

manager.py - Implements K8sManager for handling K8s resource operations

task.py - Defines DataServiceTask and configuration structures

sensor.py - Implements CleanupSensor for resource cleanup operations

Testing - Unit Tests for K8s Data Service

test_manager.py - Comprehensive tests for K8sManager functionality

test_kube_config.py - Tests for Kubernetes configuration handling

Documentation - Plugin Documentation and Setup

k8sstatefuldataservice.rst - API documentation for K8s StatefulSet Data Service

setup.py - Plugin setup and dependency configuration

Other Improvements - Configuration Updates

.pre-commit-config.yaml - Updates pre-commit hooks and configurations

Dockerfile.agent - Adds K8s data service plugin to agent image

Testing - K8s Data Service Plugin Tests

test_agent.py - Adds comprehensive tests for DataServiceAgent functionality including creation, status checks, and cleanup

test_sensor.py - Implements tests for CleanupSensor with focus on resource cleanup operations

test_task.py - Tests DataServiceTask configuration and serialization

test_resources.py - Tests resource management utilities including cleanup and field conversion

New Feature - Infrastructure Utilities

infra.py - Adds infrastructure name generation utility

resources.py - Implements resource management utilities for K8s resources

Other Improvements - Configuration and Type Updates

setup.py - Adds k8sdataservice plugin to setup configuration

test_remote.py - Adds test for pydantic default input with map task

test_generice_idl_type_engine.py - Updates type annotations

test_type_engine.py - Updates type annotations

shuyingliang and others added 17 commits January 22, 2025 09:33
Signed-off-by: Shuying Liang <[email protected]>
* Fix pydantic default input

Signed-off-by: Future-Outlier <[email protected]>

* add pydantic integration test

Signed-off-by: Future-Outlier <[email protected]>

* Use duck typing by Thomas's advice

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>

* lint

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
* fix: Open FlyteFile from remote path

Signed-off-by: JiaWei Jiang <[email protected]>

* Add integration test

Signed-off-by: JiaWei Jiang <[email protected]>

* refactor: Use ctx as param instead of recreation

Signed-off-by: JiaWei Jiang <[email protected]>

* refactor: Clean test logic

1. Remove redundant prints
2. Use `mock.patch.dict` to setup `os.environ` for the current test fn
    * Avoid contaminating other tests running in the same process

Signed-off-by: JiaWei Jiang <[email protected]>

* refactor: Setup local path and downloader in constructor

Signed-off-by: JiaWei Jiang <[email protected]>

* refactor: Move SimpleFileTransfer to an utility file

Signed-off-by: JiaWei Jiang <[email protected]>

* Remove redundant env var setup

Please refer to flyteorg#3001

Signed-off-by: JiaWei Jiang <[email protected]>

* test: Add another ff use case

Create ff in one task pod and read it in another task pod.

Signed-off-by: JiaWei Jiang <[email protected]>

---------

Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
* test: Add integration test for attr access of sd

Signed-off-by: JiaWei Jiang <[email protected]>

* Correct file path

Signed-off-by: JiaWei Jiang <[email protected]>

* test: Support interaction with minio s3 bucket

1. Upload a local parquet file to minio s3 bucket
2. Access StructuredDataset attr from a dataclass
3. Open StructuredDataset from a remote path

Signed-off-by: JiaWei Jiang <[email protected]>

* Delete an unmerged integration test

Signed-off-by: JiaWei Jiang <[email protected]>

* Try imagespec with commit sha of corresponding fix

Signed-off-by: JiaWei Jiang <[email protected]>

* Remove redundant test

Signed-off-by: JiaWei Jiang <[email protected]>

* Remove default_factory and create sd dc from input uri

Signed-off-by: JiaWei Jiang <[email protected]>

* refactor: Clean test logic

1. Remove redundant prints
2. Use `mock.patch.dict` to setup `os.environ` for the current test fn
    * Avoid contaminating other tests running in the same process

Signed-off-by: JiaWei Jiang <[email protected]>

* Remove redundant minio env var setup and add test comments

Signed-off-by: JiaWei Jiang <[email protected]>

* Support uploading tmp pqt file

Signed-off-by: JiaWei Jiang <[email protected]>

* Udpate deprecated module

Signed-off-by: JiaWei Jiang <[email protected]>

* Remove redundant and unused imports

Signed-off-by: JiaWei Jiang <[email protected]>

---------

Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
* make _downloader function in FlyteFile/Directory pickleable

Signed-off-by: Niels Bantilan <[email protected]>

* make FlyteFile and Directory pickleable

Signed-off-by: Niels Bantilan <[email protected]>

* remove unnecessary helper functions

Signed-off-by: Niels Bantilan <[email protected]>

* fix lint

Signed-off-by: Niels Bantilan <[email protected]>

* use partials instead of lambda

Signed-off-by: Niels Bantilan <[email protected]>

* fix lint

Signed-off-by: Niels Bantilan <[email protected]>

* remove unneeded helper function

Signed-off-by: Niels Bantilan <[email protected]>

* update FlyteFilePathTransformer.downloader method

Signed-off-by: Niels Bantilan <[email protected]>

* remove downloader staticmethod

Signed-off-by: Niels Bantilan <[email protected]>

* fix lint

Signed-off-by: Niels Bantilan <[email protected]>

---------

Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
pingsutw
pingsutw previously approved these changes Jan 22, 2025
@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 22, 2025

Code Review Agent Run #62dbf3

Actionable Suggestions - 5
  • plugins/flytekit-k8sdataservice/utils/infra.py - 2
    • Consider longer hash for infra naming · Line 8-8
    • Consider longer hash for infra naming · Line 8-8
  • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py - 2
    • Consider adding error handling for resources · Line 81-81
    • Consider adding error handling for resources · Line 81-81
  • tests/flytekit/unit/core/test_generice_idl_type_engine.py - 1
Additional Suggestions - 2
  • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py - 1
    • Consider making namespace configurable · Line 26-26
  • plugins/flytekit-k8sdataservice/tests/k8sdataservice/utils/test_resources.py - 1
    • Consider more descriptive test method name · Line 78-78
Review Details
  • Files reviewed - 23 · Commit Range: 272631e..7ccdeeb
    • .pre-commit-config.yaml
    • Dockerfile.agent
    • docs/source/plugins/k8sstatefuldataservice.rst
    • plugins/flytekit-k8sdataservice/dev-requirements.txt
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/__init__.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/kube_config.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/sensor.py
    • plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/task.py
    • plugins/flytekit-k8sdataservice/setup.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/k8s/test_kube_config.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/k8s/test_manager.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_sensor.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_task.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/utils/test_resources.py
    • plugins/flytekit-k8sdataservice/utils/infra.py
    • plugins/flytekit-k8sdataservice/utils/resources.py
    • plugins/setup.py
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/unit/core/test_generice_idl_type_engine.py
    • tests/flytekit/unit/core/test_type_engine.py
  • Files skipped - 2
    • .github/workflows/pythonbuild.yml - Reason: Filter setting
    • plugins/flytekit-k8sdataservice/README.md - Reason: Filter setting
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 22, 2025

Code Review Agent Run #839c30

Actionable Suggestions - 0
Review Details
  • Files reviewed - 2 · Commit Range: 7ccdeeb..37e6e56
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_sensor.py
    • plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_task.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@pingsutw pingsutw merged commit 8a6bbd0 into flyteorg:master Jan 23, 2025
104 of 106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants