Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add probe usage practice for super large models, including multi-node case #782

Open
Jeffwan opened this issue Mar 3, 2025 · 0 comments
Assignees
Labels
area/performance kind/documentation Improvements or additions to documentation kind/enhancement New feature or request priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 3, 2025

🚀 Feature Description and Motivation

When we deploy deepseek 671B model using multi-node way, start up takes very long. It brings few problems

  1. It's better to use startupProbe and livenessProbe, readinessProbe to control the interval separately.
  2. ray cluster probe can be managed and injected by ray cluster controller. this is helpful for ray cluster controller to manage raycluster in fault tolerant way. however, we care more about the application status vLLM.

We need to build some practice on this, how to make two mechanisms work together or just use application one instead.

Use Case

fault tolerance and high availability

Proposed Solution

No response

@Jeffwan Jeffwan added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/performance kind/documentation Improvements or additions to documentation kind/enhancement New feature or request labels Mar 3, 2025
@Jeffwan Jeffwan self-assigned this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance kind/documentation Improvements or additions to documentation kind/enhancement New feature or request priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

1 participant