Skip to content

[BugFix][V1] Fix memory profiling bug #18974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ProExpertProg
Copy link
Contributor

@ProExpertProg ProExpertProg commented May 30, 2025

This fixes a bug in memory profiling where we conservatively assume all GPU memory usage is due to non-torch forward pass allocation. If 2 LLM instances are allocated in the same test case, the second one will likely think it ran out of memory.

Introduced in #10528.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is too conservative?
I think it's reasonable to ask that two vLLM instances should not be initialized at the same time on the same GPU.

@ProExpertProg
Copy link
Contributor Author

ProExpertProg commented May 30, 2025

I think it's reasonable to ask that two vLLM instances should not be initialized at the same time on the same GPU.

That makes sense but 1. we do it in tests all the time and 2. we're still double counting memory right now. If there is something on the GPU that was present before the profile run, we treat it as if it's used by the vLLM forward pass.

A way around this in tests is to init the first LLM with gpu_memory_utilization=0.45 and the second one gpu_memory_utilization=0.9 (as it'll count both), which seems confusing.

@WoosukKwon
Copy link
Collaborator

A way around this in tests is to init the first LLM with gpu_memory_utilization=0.45 and the second one gpu_memory_utilization=0.9 (as it'll count both), which seems confusing.

Oh I didn't know this. Then, this is clearly a bug. Sorry for that 🙏

@ProExpertProg
Copy link
Contributor Author

Oh no worries at all it took me a really long time debugging to understand why this edge case is not working right. I'll add an extra defensive check

@ProExpertProg ProExpertProg force-pushed the luka/fix-memory-profiling branch from 2ede33a to a509587 Compare May 31, 2025 00:03
@youkaichao
Copy link
Member

there's a similar PR #18296 trying to solve it by reusing the utility functions in v0, can you help take a look to see which one works better? @ProExpertProg

@NickLucche
Copy link
Contributor

This is also highly related trying to port the same v0 functionality with logging on top #17122

@maxdebayser
Copy link
Contributor

I've opened an issue for this last week: #18854 . I'm glad I'm not the only one to experience this problem. Last week I've opened a PR for this as well, both do the same thing but this one here has more complete logging.

@maxdebayser
Copy link
Contributor

As far as I can see, the only functional difference with @calvin0327 's PR is that the torch memory usage in one is measured with allocated_bytes.all.current and in the other with reserved_bytes.all.current. Since reserved is an upper bound for allocated it seems to me that it is more conservative and less likely to cause OOM errors to use reserved.

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 6, 2025
@WoosukKwon
Copy link
Collaborator

Basically, I think

  • We should support multiple vLLM instances per GPU, and the gpu_memory_utilization of each instance should mean how much each instance is using. For example, we should be able to support two vLLM instances running on the same GPU with gpu_memory_utilization=0.45.
  • That said, I don't mean that we should support spinning up multiple vLLM instances at the same time. I think it's reasonable to require that only one vLLM instance should be initialized at a time.
  • I believe we can't ignore the GPU memory usage outside PyTorch. NCCL, CUDA graphs, and some other GPU libraries allocate their own workspace.

@WoosukKwon WoosukKwon enabled auto-merge (squash) June 6, 2025 03:41
@WoosukKwon WoosukKwon added this to the v0.9.1 milestone Jun 6, 2025
auto-merge was automatically disabled June 7, 2025 02:03

Head branch was pushed to by a user without write access

@ProExpertProg ProExpertProg force-pushed the luka/fix-memory-profiling branch from a509587 to 90a6e99 Compare June 7, 2025 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants