Use execute_model for warmup #26

rafvasq · 2025-03-13T15:26:46Z

PR to use execute_model in the model warmup v1.worker.SpyreWorker.compile_or_warm_up_model instead of a separate dummy forward pass.

Closes #12

Signed-off-by: Rafael Vasquez <[email protected]>

github-actions · 2025-03-13T15:26:59Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Rafael Vasquez <[email protected]>

joerunde · 2025-03-14T18:45:33Z

Can _raw_model_forward on the model runner be removed too?

Signed-off-by: Rafael Vasquez <[email protected]>

yannicks1 · 2025-03-15T10:58:01Z

vllm_spyre/v1/worker/spyre_worker.py

            print(f"[SpyreWorker] Warming up for prompt length {prompt_len}, "
                  f"decoding {num_decode_tokens} tokens with batch "
                  f"size {batch_size}")
-            self._warmup_spyre_fixed_size(prompt_len, num_decode_tokens,


personally, I like the idea of a helper function here to make things more readable.

+1, specifically avoiding double-nested for loops is nice to do

yannicks1 · 2025-03-15T11:02:23Z

vllm_spyre/v1/worker/spyre_worker.py

+                dummy_requests.append(
+                    NewRequestData(
+                        req_id=f"warmup-{i}",
+                        prompt_token_ids=[1] * prompt_len,


We previously sampled these tokens from a list of valid tokens:

valid_token_ids = [ i for i in range(1, vocab_size) if i not in set(special_token_ids)]

where special_token_ids contains BOS, EOS and pad token ids. Not sure whether this was needed, or what happens if in your case any of the above special token ids are 1.

Hah, for the very first dummy model I checked this is true: https://huggingface.co/JackFram/llama-160m/blob/main/config.json#L6

I think vllm uses a bunch of repeated token id 0 for profiling, since the input ids tensor is just initialized with torch.zeros and for text-only models it's not updated for profiling.

The general idea with setting a repeated token ID was to have the model continue the sequence, so it doesn't end up hitting an eos token early and stopping. But if we control the loop here that keeps invoking the model, maybe that doesn't matter.

yeah, I also suspect it does not matter as we force the decode steps in the loop, but rather save than sorry:)

yannicks1 · 2025-03-15T11:08:28Z

vllm_spyre/v1/worker/spyre_worker.py

+            )
+
+            # Use execute_model for warm up
+            self.execute_model(scheduler_output)


As far as I understand this executes the model only for one step. In your case it does the prefill and generates the 1st token. We need to warm up not only for prefill, but also for (num_decode_tokens - 1) decode steps (since prefill already produced 1 token).

yannicks1 · 2025-03-15T11:12:59Z

vllm_spyre/v1/worker/spyre_worker.py

+            if envs_spyre.VLLM_SPYRE_DYNAMO_BACKEND == "sendnn_decoder":
+                from torch_sendnn import torch_sendnn
+                ul_start_time = time.time()
+                torch_sendnn.update_lazyhandle()
+                ul_stop_time = time.time()
+                ul_total_t = ul_stop_time - ul_start_time
+                print(f"update_lazyhandle() done (duration: {ul_total_t}s)")
+


After torch_sendnn.update_lazyhandle() there is a second complete warmup needed.

To sum up:

complete forward pass: prefill plus (num_decode_tokens - 1) decode steps

torch_sendnn.update_lazyhandle()

complete forward pass: prefill plus (num_decode_tokens - 1) decode steps

See comment below.

Thanks for clarifying this, learning as I go but it makes sense to me now.

I took another stab at it, still trying to use execute_model to avoid doing anything manually except the dummy data setup.

yannicks1 · 2025-03-15T11:14:55Z

vllm_spyre/v1/worker/spyre_worker.py

-        # 1. trace
-        print("[SpyreWorker] warmup 1/2...")
-        # TODO: torch_sendnn.CleanGraph() should be necessary?
-        # warmup 1st forward pass
-        self._warmup_model_forward_pass(warmup_tokens_tensor,
-                                        valid_token_ids_tensor, prompt_len,
-                                        num_decode_tokens, batch_size,
-                                        extra_kwargs)
-
-        # 2. compile
-        print("[SpyreWorker] warmup 2/2...")
-        if envs_spyre.VLLM_SPYRE_DYNAMO_BACKEND == "sendnn_decoder":
-            from torch_sendnn import torch_sendnn
-            ul_start_time = time.time()
-            torch_sendnn.update_lazyhandle()
-            ul_stop_time = time.time()
-            ul_total_t = ul_stop_time - ul_start_time
-            print(f"update_lazyhandle() done (duration: {ul_total_t}s)")
-
-        # warmup 2nd forward pass
-        self._warmup_model_forward_pass(warmup_tokens_tensor,
-                                        valid_token_ids_tensor, prompt_len,
-                                        num_decode_tokens, batch_size,
-                                        extra_kwargs)


this has to happen for each warmup shape (combination of prompt_len, num_decode_tokens, batch_size)

wallashss · 2025-03-17T15:59:33Z

Unfortunately, I could not yet tried your changes because of problems in my dev environment. But, besides the feedback from other reviewers, I feel that the final code should have more comments, specially the review comments pointed out by @yannicks1 for better understanding later.

joerunde · 2025-03-18T15:44:47Z

@rafvasq Can you also update all the prints in this code to logging statements?

See #31 for context. I didn't update the prints in this warmup code because I knew you were rewriting it

Signed-off-by: Rafael Vasquez <[email protected]>

vllm_spyre/v1/worker/spyre_worker.py

Signed-off-by: Rafael Vasquez <[email protected]>

joerunde

LPGTM!

I'll let @yannicks1 take another look through since he has a better understanding of the requirements

yannicks1

Other than the two minor comments this looks good to me! Thanks for contributing!

PS: I am mildly confused why the spyre-tests did not succeed.
tests/test_spyre_embeddings.py fails with V0. But as your code changes only touch V1, the reason has got to be somewhere else...

yannicks1 · 2025-03-20T10:14:35Z

vllm_spyre/v1/worker/spyre_worker.py

+        logger.info("Warmup 1/2: Prefill...")
+        self.execute_model(scheduler_output)  # Prefill step
+
+        # Switch to cached requests to trigger decoding steps
+        scheduler_output.scheduled_new_reqs = []
+        scheduler_output.scheduled_cached_reqs = cached_requests
+
+        logger.info("Warmup 1/2: Decoding...")
+        for _ in range(num_decode_tokens - 1):
+            self.execute_model(scheduler_output)
+


Personally, I am for the use of helper functions wherever they help reducing code duplication. I am aware its just 10 lines here, but they could be eliminated by reusing the _warmup_model_forward_pass we introduced in the original implementation.

Sounds good, I re-introduced _warmup_model_forward_pass to handle the duplicate pass code

vllm_spyre/v1/worker/spyre_worker.py

* 🐛 fix batch handling in V1 runner Signed-off-by: Joe Runde <[email protected]> * ⚗️ try v1 test only Signed-off-by: Joe Runde <[email protected]> * ⚗️ add a bit more prompt Signed-off-by: Joe Runde <[email protected]> * ⚗️ unclear why CI won't count to 0 Signed-off-by: Joe Runde <[email protected]> * ♻️ rename map_output_indices Signed-off-by: Joe Runde <[email protected]> --------- Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Rafael Vasquez <[email protected]>

yannicks1

LGTM! Thanks for contributing.

execute_model with a warm up mode

e79c317

Signed-off-by: Rafael Vasquez <[email protected]>

rafvasq requested a review from joerunde March 13, 2025 15:26

rafvasq changed the title ~~Use execute_model for warmup~~ (Draft) Use execute_model for warmup Mar 13, 2025

rafvasq added 6 commits March 13, 2025 18:00

Remove funcs and prints

3a627a5

Signed-off-by: Rafael Vasquez <[email protected]>

Fix

2603096

Signed-off-by: Rafael Vasquez <[email protected]>

Lints

8e94e8a

Signed-off-by: Rafael Vasquez <[email protected]>

Another refactor to leave execute_model untouched

74c0f60

Signed-off-by: Rafael Vasquez <[email protected]>

this one too

d29de93

Signed-off-by: Rafael Vasquez <[email protected]>

Small lint

e5177ac

Signed-off-by: Rafael Vasquez <[email protected]>

rafvasq marked this pull request as ready for review March 14, 2025 16:14

rafvasq changed the title ~~(Draft) Use execute_model for warmup~~ Use execute_model for warmup Mar 14, 2025

Removes _raw_model_forward

99ac9d4

Signed-off-by: Rafael Vasquez <[email protected]>

yannicks1 self-requested a review March 15, 2025 10:47

yannicks1 requested changes Mar 15, 2025

View reviewed changes

rafvasq added 5 commits March 18, 2025 14:23

Attempt a revamp

e555ecf

Signed-off-by: Rafael Vasquez <[email protected]>

Update prints to logger

79ee4c4

Signed-off-by: Rafael Vasquez <[email protected]>

Merge branch 'main' into refactor-warmup

aca9c1f

Lints, fixes prints

972fc87

Signed-off-by: Rafael Vasquez <[email protected]>

Move function back

b9c03b6

Signed-off-by: Rafael Vasquez <[email protected]>

rafvasq requested a review from yannicks1 March 19, 2025 16:02

joerunde reviewed Mar 19, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_worker.py Outdated Show resolved Hide resolved

joerunde reviewed Mar 19, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_worker.py Outdated Show resolved Hide resolved

Fix logging typos

4a48c7b

Signed-off-by: Rafael Vasquez <[email protected]>

joerunde approved these changes Mar 19, 2025

View reviewed changes

yannicks1 reviewed Mar 20, 2025

View reviewed changes

Refactor forward_pass, update comments/logs for clarity

45cec8f

Signed-off-by: Rafael Vasquez <[email protected]>

rafvasq force-pushed the refactor-warmup branch from af74bfd to 45cec8f Compare March 20, 2025 16:19

Merge branch 'main' into refactor-warmup

daa6ded

rafvasq requested a review from yannicks1 March 20, 2025 19:13

yannicks1 approved these changes Mar 21, 2025

View reviewed changes

yannicks1 merged commit a2159f8 into vllm-project:main Mar 21, 2025
9 checks passed

rafvasq deleted the refactor-warmup branch March 21, 2025 15:58

Use execute_model for warmup #26

Use execute_model for warmup #26

Uh oh!

Conversation

rafvasq commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 13, 2025

Uh oh!

joerunde commented Mar 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallashss commented Mar 17, 2025

Uh oh!

joerunde commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

joerunde left a comment

Choose a reason for hiding this comment

Uh oh!

yannicks1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafvasq Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yannicks1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rafvasq commented Mar 13, 2025 •

edited

Loading

joerunde Mar 17, 2025 •

edited

Loading

rafvasq Mar 20, 2025 •

edited

Loading