Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

Open
5 tasks done
SuperMasterBlasterLaser opened this issue Dec 27, 2024 · 5 comments
Open
5 tasks done

Comments

@SuperMasterBlasterLaser

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I have rented RTX 6000Ada with 48.0 GB VRAM GPU via vast.ai.

Specs:

  1. Ubuntu 22.04
  2. PyTorch 2.4.1
  3. cuda12.4

Then I have installed flashinfer by this command:

pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

Then installed this lib with this command:

pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Then downloaded Llama-3.2-11B-Vision-Instruct and launched it like this:

python -m sglang.launch_server --model-path /root/Llama-3.2-11B-Vision-Instruct --port 8080 --host 0.0.0.0

Then I have used this simple code to infer an image:

import sglang as sgl


base_url = "url.to.my.server"

@sgl.function
def caption_image(s, image_file):
    s += sgl.user(sgl.image(image_file) + "What is the overall style of this image?")
    s += sgl.assistant(sgl.gen("global_style", choices=["cinematic", "animated", "anime", "3d", "cartoon"]))
    s += sgl.user("Overall description of this image:")
    s += sgl.assistant(sgl.gen("description", max_tokens=255))


sgl.set_default_backend(sgl.RuntimeEndpoint(base_url))

image_path = "./example.png"

state = caption_image.run(image_file=image_path)

print(state["global_style"])
print(state["description"])
print(state.text())

However, when I launched this code just to check simple image, it just hangs and I receive no response or even error message.

Logs:

[2024-12-27 17:04:20 TP0] Overlap scheduler is disabled for multimodal models.
[2024-12-27 17:04:20 TP0] Automatically turn off --chunked-prefill-size for mllama.
[2024-12-27 17:04:20 TP0] Init torch distributed begin.
[2024-12-27 17:04:21 TP0] Load weight begin. avail mem=46.99 GB
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:00,  4.62it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.63it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.22it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.15it/s]

[2024-12-27 17:04:26 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=26.84 GB
[2024-12-27 17:04:26 TP0] Memory pool end. avail mem=6.62 GB
[2024-12-27 17:04:26 TP0] Capture cuda graph begin. This can take up to several minutes.
[00:11<00:00,  2.00it/s]
[2024-12-27 17:04:38 TP0] Capture cuda graph end. Time elapsed: 11.53 s
[2024-12-27 17:04:38 TP0] max_total_num_tokens=125417, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-12-27 17:04:39] INFO:     Started server process [1126]
[2024-12-27 17:04:39] INFO:     Waiting for application startup.
[2024-12-27 17:04:39] INFO:     Application startup complete.
[2024-12-27 17:04:39] INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
[2024-12-27 17:04:40] INFO:     127.0.0.1:34840 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:04:40 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:04:40] INFO:     127.0.0.1:34854 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:04:40] The server is fired up and ready to roll!
[2024-12-27 17:04:47] INFO:     91.198.101.42:57416 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:05:18 TP0] Prefill batch. #new-seq: 1, #new-token: 6425, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:18] INFO:     91.198.101.42:14407 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:05:20 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 49.95%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:21 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 66.61%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:23 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 74.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 79.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 83.27%, token usage: 0.05, #running-req: 0, #queue-req: 0

I don't understand why this is happening?

Reproduction

I have written it in description

Environment

Specs:

  1. Ubuntu 22.04
  2. PyTorch 2.4.1
  3. cuda12.4
  4. RTX 6000Ada
@SuperMasterBlasterLaser
Copy link
Author

I found out that when I use select or choice it hangs, but simple gen without any other constraints returns generated results.

@bluenevus
Copy link

I had to roll back to v0.4.0 for 11b vision to work again. It errors out on 0.4.1 for me.

@SuperMasterBlasterLaser
Copy link
Author

@bluenevus does gen with choices or select methods work on v0.4.0?

@SuperMasterBlasterLaser
Copy link
Author

OK. I thought that hanging issue is connected with lack of VRAM of GPU. So I had rented H100 with 80 GB VRAM in order to launch Llama-3.2-11B-Vision-Instruct and ran this simple script:

@sgl.function
def caption_image(s, image_file):
    s += "You are very smart image captioning service"
    s += "Given this image: " + sgl.image(image_file)
    s += "Overall style of this image is: " + sgl.select("global_style", choices=["cinematic", "animated", "anime", "3d", "cartoon", "digital art"])

sgl.set_default_backend(sgl.RuntimeEndpoint(base_url))

image_path = "./examples/image.png"

state = caption_image.run(image_file=image_path)

print(state["global_style"])

And it still hangs with these logs:

[2025-01-01 15:27:41 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 88.85%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:41 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 89.96%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:41 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 90.87%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:42 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6426, cache hit rate: 91.63%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:42 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 92.27%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:42 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6426, cache hit rate: 92.82%, token usage: 0.02, #running-req: 0, #queue-req: 0

Then I have changed --max-prefill-tokens to 8291 and it does not work. Then I changed models to LLava and it also hangs on choice generation with same logs

I think overall choices and select methods for multi modal is broken or does not work at all.

@bluenevus
Copy link

@bluenevus does gen with choices or select methods work on v0.4.0?

Not sure what that means but you can see the compose components here

deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1', '3']
capabilities: [gpu]
shm_size: '32gb'
ipc: host
ports:
- "8011:30000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
python3 -m sglang.launch_server
--model-path alpindale/Llama-3.2-11B-Vision-Instruct
--host 0.0.0.0
--port 30000
--device cuda
--kv-cache-dtype auto
--dtype float16
--tp-size 2
--context-length 32768
--max-running-requests 12
--attention-backend flashinfer
--sampling-backend flashinfer
--trust-remote-code
--mem-fraction-static 0.95
--disable-cuda-graph
--enable-torch-compile
--chat-template llama_3_vision
--grammar-backend xgrammar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants