[BugFix][V1] Fix memory profiling bug #18974

ProExpertProg · 2025-05-30T20:45:40Z

This fixes a bug in memory profiling where we conservatively assume all GPU memory usage is due to non-torch forward pass allocation. If 2 LLM instances are allocated in the same test case, the second one will likely think it ran out of memory.

Introduced in #10528.

github-actions · 2025-05-30T20:45:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WoosukKwon

I don't think this is too conservative?
I think it's reasonable to ask that two vLLM instances should not be initialized at the same time on the same GPU.

ProExpertProg · 2025-05-30T22:20:52Z

I think it's reasonable to ask that two vLLM instances should not be initialized at the same time on the same GPU.

That makes sense but 1. we do it in tests all the time and 2. we're still double counting memory right now. If there is something on the GPU that was present before the profile run, we treat it as if it's used by the vLLM forward pass.

A way around this in tests is to init the first LLM with gpu_memory_utilization=0.45 and the second one gpu_memory_utilization=0.9 (as it'll count both), which seems confusing.

WoosukKwon · 2025-05-30T22:29:43Z

A way around this in tests is to init the first LLM with gpu_memory_utilization=0.45 and the second one gpu_memory_utilization=0.9 (as it'll count both), which seems confusing.

Oh I didn't know this. Then, this is clearly a bug. Sorry for that 🙏

ProExpertProg · 2025-05-30T22:41:35Z

Oh no worries at all it took me a really long time debugging to understand why this edge case is not working right. I'll add an extra defensive check

youkaichao · 2025-06-02T06:54:58Z

there's a similar PR #18296 trying to solve it by reusing the utility functions in v0, can you help take a look to see which one works better? @ProExpertProg

NickLucche · 2025-06-03T12:12:38Z

This is also highly related trying to port the same v0 functionality with logging on top #17122

maxdebayser · 2025-06-03T14:05:41Z

I've opened an issue for this last week: #18854 . I'm glad I'm not the only one to experience this problem. Last week I've opened a PR for this as well, both do the same thing but this one here has more complete logging.

maxdebayser · 2025-06-03T14:30:18Z

As far as I can see, the only functional difference with @calvin0327 's PR is that the torch memory usage in one is measured with allocated_bytes.all.current and in the other with reserved_bytes.all.current. Since reserved is an upper bound for allocated it seems to me that it is more conservative and less likely to cause OOM errors to use reserved.

WoosukKwon · 2025-06-06T03:33:16Z

Basically, I think

We should support multiple vLLM instances per GPU, and the gpu_memory_utilization of each instance should mean how much each instance is using. For example, we should be able to support two vLLM instances running on the same GPU with gpu_memory_utilization=0.45.
That said, I don't mean that we should support spinning up multiple vLLM instances at the same time. I think it's reasonable to require that only one vLLM instance should be initialized at a time.
I believe we can't ignore the GPU memory usage outside PyTorch. NCCL, CUDA graphs, and some other GPU libraries allocate their own workspace.

Signed-off-by: luka <[email protected]>

ProExpertProg requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 30, 2025 20:45

ProExpertProg mentioned this pull request May 30, 2025

[V1] Fix Compilation config & Enable CUDA graph by default #10528

Merged

mergify bot added the v1 label May 30, 2025

WoosukKwon reviewed May 30, 2025

View reviewed changes

ProExpertProg force-pushed the luka/fix-memory-profiling branch from 2ede33a to a509587 Compare May 31, 2025 00:03

youkaichao mentioned this pull request Jun 3, 2025

Account for memory usage of other processes #18858

Open

maxdebayser approved these changes Jun 3, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 6, 2025

WoosukKwon approved these changes Jun 6, 2025

View reviewed changes

WoosukKwon enabled auto-merge (squash) June 6, 2025 03:41

WoosukKwon added this to the v0.9.1 milestone Jun 6, 2025

ProExpertProg added 2 commits June 7, 2025 02:03

Fix memory profiling bug

0819390

Signed-off-by: luka <[email protected]>

Add better logging messages and an early extra defensive check

90a6e99

Signed-off-by: luka <[email protected]>

auto-merge was automatically disabled June 7, 2025 02:03
Head branch was pushed to by a user without write access

ProExpertProg force-pushed the luka/fix-memory-profiling branch from a509587 to 90a6e99 Compare June 7, 2025 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix][V1] Fix memory profiling bug #18974

[BugFix][V1] Fix memory profiling bug #18974

ProExpertProg commented May 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

WoosukKwon left a comment

Uh oh!

ProExpertProg commented May 30, 2025 •

edited

Loading

Uh oh!

WoosukKwon commented May 30, 2025

Uh oh!

ProExpertProg commented May 30, 2025

Uh oh!

youkaichao commented Jun 2, 2025

Uh oh!

NickLucche commented Jun 3, 2025

Uh oh!

maxdebayser commented Jun 3, 2025

Uh oh!

maxdebayser commented Jun 3, 2025

Uh oh!

WoosukKwon commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

[BugFix][V1] Fix memory profiling bug #18974

Are you sure you want to change the base?

[BugFix][V1] Fix memory profiling bug #18974

Conversation

ProExpertProg commented May 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented May 30, 2025

Uh oh!

ProExpertProg commented May 30, 2025

Uh oh!

youkaichao commented Jun 2, 2025

Uh oh!

NickLucche commented Jun 3, 2025

Uh oh!

maxdebayser commented Jun 3, 2025

Uh oh!

maxdebayser commented Jun 3, 2025

Uh oh!

WoosukKwon commented Jun 6, 2025

Uh oh!

Uh oh!

ProExpertProg commented May 30, 2025 •

edited by github-actions bot

Loading

ProExpertProg commented May 30, 2025 •

edited

Loading