Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OOM when setting return_logprob=True #2607

Open
5 tasks done
CSammyfd opened this issue Dec 27, 2024 · 2 comments
Open
5 tasks done

[Bug] OOM when setting return_logprob=True #2607

CSammyfd opened this issue Dec 27, 2024 · 2 comments

Comments

@CSammyfd
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

  • background: For each text, I just need to forward once and take the probs of some part in prompts
  • action: I set return_logprob=True and logprob_start_len=2000 (also max_new_tokens=1)
  • result:
      1. When sending single request, it works correctly and I can get the latest 2000 token probs
      1. When sending multiple requests(50 requests in 1 second), the server ran into OOM
      1. When sending multiple requests(50 requests in 1 second) and cancel the return_logprob parameter, the server works normally

I wonder if the problem is caused by return_logprob?
I ran into the similar problem when I use vllm,and have found the issues about it:
vllm-project/vllm#5067
vllm-project/vllm#1532
vllm-project/vllm#5907
https://github.com/vllm-project/vllm/pull/5355
So maybe the reasons are the same? : CUDA memory used by calculating prompt_logprobs is not counted in profile-running

Reproduction

from openai import OpenAI
import requests
import threading
from queue import Queue
import requests
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

client = None

model_name = "./models/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt300},
{"role": "assistant", "content": prompt
600}
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)

def send_request(index, text, queue, client):
response = requests.post(
"http://10.48.2.2:30000/generate",
json={
"text": text,
"sampling_params": {
"temperature": 0,
"max_new_tokens": 1,
},
"return_logprob": True,
"logprob_start_len": 3000
},
)

queue.put((index, response))

import time
start_time = time.time()
results_queue = Queue()
threads = []
for i in range(50):
thread = threading.Thread(target=send_request, args=(i, text, results_queue, client))
thread.start()
threads.append(thread)

for thread in threads:
thread.join()
end_time = time.time()

print(end_time- start_time)

completion_list = [None for _ in range(50)]
for _ in range(50):
index, result = results_queue.get()
completion_list[index] = result

res_json = json.loads(completion_list[0].text)

Environment

2024-12-27 10:51:32,563 - modelscope - INFO - PyTorch version 2.4.0 Found.
2024-12-27 10:51:32,564 - modelscope - INFO - Loading ast index from /home/yanhui_he/.cache/modelscope/ast_indexer
2024-12-27 10:51:32,717 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 cac1c2695a261ce83ddea2be8560cb8b and a total number of 972 components indexed
WARNING 12-27 10:51:34 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA A100 80GB PCIe
GPU 0,1,2,3,4,5,6,7,8,9 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda-12.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.15
PyTorch: 2.4.0+cu121
sglang: 0.4.1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.115.0
hf_transfer: Module Not Found
huggingface_hub: 0.24.5
interegular: 0.3.3
modelscope: 1.13.3
orjson: 3.9.15
packaging: 23.2
psutil: 5.9.6
pydantic: 2.9.2
multipart: 0.0.9
zmq: 26.0.3
uvicorn: 0.29.0
uvloop: 0.19.0
vllm: 0.6.1.post2
xgrammar: Module Not Found
openai: 1.47.1
anthropic: 0.25.8
litellm: Module Not Found
decord: Module Not Found
NVIDIA Topology:
�[4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 CPU Affinity NUMA Affinity GPU NUMA ID�[0m
GPU0 X NV12 PXB PXB PXB SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV12 X PIX PXB PXB SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 PXB PIX X PXB PXB SYS SYS NV12 SYS SYS 0-31,64-95 0 N/A
GPU3 PXB PXB PXB X NV12 SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 PXB PXB PXB NV12 X SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU5 SYS SYS SYS SYS SYS X NV12 PXB PXB PXB 32-63,96-127 1 N/A
GPU6 SYS SYS SYS SYS SYS NV12 X PIX PXB PXB 32-63,96-127 1 N/A
GPU7 SYS SYS NV12 SYS SYS PXB PIX X PXB PXB 32-63,96-127 1 N/A
GPU8 SYS SYS SYS SYS SYS PXB PXB PXB X NV12 32-63,96-127 1 N/A
GPU9 SYS SYS SYS SYS SYS PXB PXB PXB NV12 X 32-63,96-127 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 655350

@merrymercy
Copy link
Contributor

reduce --mem-fraction-static

@CSammyfd
Copy link
Author

@merrymercy
I reduced mem-fraction-static from 0.9 to 0.3 and it didnt work.
For comparison, when I set return_logprob to False and kept mem-fraction-static=0.95, it wouldnt run into OOM in the same high cocurrency situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants