Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #492

Merged

Conversation

zhangjyr
Copy link
Collaborator

@zhangjyr zhangjyr commented Dec 6, 2024

Pull Request Description

This PR fixed the connectivity problem between podautoscaler and GPU optimizer by:

  1. Updated k8s role definition to use ClusterRole, so GPU optimizer now monitor all deployments in all namespace with model label
  2. Include [WIP] Add GPU Optimizer deployment and update configurations #480 changes. Deployment configurations are integrated into config/default

Note: I keep the deployment.yaml under GPU optimizer undeleted for debugging purposes.

Related Issues

Resolves: #484 #480 #459

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

nwangfw and others added 3 commits December 4, 2024 13:13
…_failed_to_fetch_metrics_from_MetricSource

# Conflicts:
#	development/simulator/deployment-a100.yaml
#	development/simulator/deployment-a40.yaml
@zhangjyr
Copy link
Collaborator Author

zhangjyr commented Dec 6, 2024

I moved some comments from #480 here:
If the server is down, what's the autoscaler behavior? Have you tested such behaviors?
We haven't tested. Ideally, if the GPU optimizer were down, the podautoscaler would not be able to read metrics and keep replicas intact. After the GPU optimizer resumes, the GPU optimizer should not output valid metrics before a new solution is reached with sufficient load traces. Most of the changes involved in this approach will be easy to implement. However, we'll need to prolong the load trace timeout in Redis (e.g., up to 300 seconds, which is aligned with the current GPU optimizer window), so the GPU optimizer can restore the solution quickly enough.

@zhangjyr zhangjyr requested review from Jeffwan and nwangfw December 6, 2024 01:14
@zhangjyr zhangjyr added this to the v0.2.0 milestone Dec 6, 2024
Jingyuan Zhang and others added 2 commits December 5, 2024 17:16
@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 6, 2024

Let's separate bug and deployment support (features) in two PRs in future.

@Jeffwan Jeffwan changed the title [Bug] Fix autoscaler/gpu optimizer connectivity and integrate deployment configurations [feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity Dec 6, 2024
@Jeffwan Jeffwan merged commit e1f6644 into main Dec 6, 2024
10 checks passed
@Jeffwan Jeffwan deleted the issues/484_Controller_failed_to_fetch_metrics_from_MetricSource branch December 6, 2024 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller failed to fetch metrics from MetricSource
3 participants