Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How to apply MInference on multiple A100 GPUs? #95

Open
XiongxiaoL opened this issue Dec 13, 2024 · 1 comment
Open

[Question]: How to apply MInference on multiple A100 GPUs? #95

XiongxiaoL opened this issue Dec 13, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@XiongxiaoL
Copy link

In the paper, it mentions 'this latency can be reduced to 22 seconds on 8x A100 GPUs', how is this achieved, and is the current version already supported?

@iofu728 iofu728 self-assigned this Dec 15, 2024
@iofu728 iofu728 added the question Further information is requested label Dec 15, 2024
@iofu728 iofu728 changed the title How to apply MInference on multiple A100 GPUs? [Question]: How to apply MInference on multiple A100 GPUs? Dec 15, 2024
@iofu728
Copy link
Contributor

iofu728 commented Dec 15, 2024

Hi @XiongxiaoL, thanks for your interested in our work.

we implement vLLM w/ TP in hjiang/support_vllm_tp branch

  1. Switch to the hjiang/support_vllm_tp branch.
  2. Run pip install -e .
  3. Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.
  4. When calling VLLM, ensure enable_chunked_prefill=False is set.
  5. Refer to the script in https://github.com/microsoft/MInference/blob/hjiang/support_vllm_tp/experiments/benchmarks/run_e2e_vllm_tp.sh:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
    --attn_type minference \
    --context_window 500_000 --tensor_parallel_size 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants