This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Upstream sync 2024 05 25 (#249) SUMMARY: Merge commits from vllm-project@c7f2cf2 to vllm-project@f68470e Note that vllm-project@c7f2cf2 is NOT included in this merge. --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> --------- Signed-off-by: kerthcet <[email protected]> Co-authored-by: zhaoyang-star <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Cade Daniel <[email protected]> Co-authored-by: Noam Gat <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Austin Veselka <[email protected]> Co-authored-by: leiwen83 <[email protected]> Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: alexm-nm <[email protected]> Co-authored-by: Mahmoud Ashraf <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: kliuae <[email protected]> Co-authored-by: miloice <[email protected]> Co-authored-by: Hao Zhang <[email protected]> Co-authored-by: Dash Desai <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Allen.Dou <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Steve Grubb <[email protected]> Co-authored-by: heeju-kim2 <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Yikang Shen <[email protected]> Co-authored-by: Swapnil Parekh <[email protected]> Co-authored-by: Sanger Steel <[email protected]> Co-authored-by: Stephen Krider <[email protected]> Co-authored-by: LiuXiaoxuanPKU <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: SAHIL SUNEJA <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Cade Daniel <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Pierre Dulac <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Silencio <[email protected]> Co-authored-by: Silencio <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kante Yin <[email protected]> Co-authored-by: bofeng huang <[email protected]> Co-authored-by: eigenLiu <[email protected]> Co-authored-by: alexeykondrat <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Domenic Barbuzzi <[email protected]>
- Loading branch information
fec3563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger_is_better
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.0737548282729463
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4249.123643131267
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.409011020079548
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
443.17143261034124
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
49.425915520304486
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3212.6845088197915
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9839046091034234
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
290.10755367617475
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
212.41516605933808
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.4478319372827997
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
755.9557777512562
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
560.8048243833222
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
8.110949351685584
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4160.917017414704
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6.777022315825469
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
881.012901057311
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
12.005617992926398
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1560.7303390804318
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
24.467710462959356
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3156.334649721757
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
24.18970475825256
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3120.47191381458
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.001360152379693
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4100.786952225991
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.7600839340665586
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3854.0860324182227
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5.3909289544752665
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
700.8207640817847
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
47.68354528965047
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3099.4304438272807
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.2317097461771
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
689.2114818793865
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
512.2874844284559
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.24076117500663205
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
31.298952750862167
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5.760386955245343
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2693.326524794513
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - 2:4 Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5.7688721055275005
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - 2:4 Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2697.2938416604384
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7.482753413234599
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3838.6525009893494
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
10.643209396798829
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1383.6172215838478
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.7619252796686458
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3855.9734116603618
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
10.48138809173603
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1362.5804519256837
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4.054618484357738
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4155.983946466682
tokens/s{"name": "request_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.303551043568131
prompts/s{"name": "input_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
711.3979902816674
tokens/s{"name": "output_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
461.2937749779968
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.7634225711787357
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3857.5081354582044
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.801382309359772
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
234.17970021677039
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
13.92541738641096
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3578.8322683076167
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.46406416959294255
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
122.67072259019844
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
115.10028910357222
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7.485233655042069
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3839.924865036581
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.24061529216630784
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
31.27998798162002
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
15.572015486541934
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4002.007980041277
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
14.261056511482876
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3665.091523451099
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6.488597325839783
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
843.5176523591718
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
9.9663761626642
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1295.6289011463462
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11.723221360700387
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1524.0187768910505
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.7611788565110156
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
98.95325134643203
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
26.10767148263572
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1696.998646371322
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.8035648075244035
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
234.46342497817247
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.018507334458996
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
132.4059534796695
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
24.48509258168909
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3158.5769430378928
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.49190563133344684
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
130.03033458668335
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
121.99259657069481
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6.778380527731943
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
881.1894686051526
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.0394265863400007
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4178.785075410661
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.800964755162785
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
234.12541817116204
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.24071980892303801
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
31.29357515999494
tokens/s{"name": "request_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.47503736595213275
prompts/s{"name": "input_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
125.57137731578676
tokens/s{"name": "output_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
99.44115527264645
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.2383631716438455
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1526.4542947650727
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
8.054568931292382
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4131.9938617529915
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
28.95521254051672
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3735.2224177266567
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9148041387050826
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
118.92453803166073
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4.468047229100103
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1415.8794864295314
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1021.09772675701
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9137033569406621
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
118.78143640228608
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.45310324945175
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
757.5836995173523
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
563.2226936611239
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5.763501051583438
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2694.7825516783523
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.796880131115109
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
363.59441704496413
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
28.732493074833176
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3706.4916066534797
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.5825356745967367
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1135.2697299229599
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
819.2709765665965
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6.7729068231492615
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
880.4778870094041
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9838451660646144
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
290.09002669803846
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
212.35314064338635
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.722598294088214
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
93.9377782314678
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7.844609518894351
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4024.284683192802
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.06421982745743
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4229.586426460274
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.49190931619773914
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
130.03130864371036
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
122.0263043714525
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
24.088141013062444
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1565.7291658490587
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9839812919439876
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
290.13016386732454
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
212.41860136722823
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
23.67526420447888
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3054.109082377775
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4.164145328681783
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4268.248961898827
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.9294087791973236
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
380.8231412956521
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9682411901603802
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
288.7069306113883
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
210.0792910290977
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.2569725363697384
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
33.40642972806599
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
15.836319601059488
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4069.9341374722885
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9476731573209913
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
279.4245893466187
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
204.5678866498336
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7.364814804763553
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3443.492810115247
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
26.18057918319307
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1701.7376469075496
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.8568163241392224
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1481.0174684694614
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.3587430760558656
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
728.4427617014128
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
541.328390964053
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.9917095136303464
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
258.92223677194505
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
17.06325742782891
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2218.223465617758
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.4662566113130598
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
126.08200445386888
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
118.14631692932059
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.040568251732798
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4181.124347800503
tokens/s{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7.171179206337352
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3352.9565497150925
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
19.960383180376216
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2594.8498134489078
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
25.564807771210155
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1661.71250512866
tokens/s{"name": "request_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4.296001301308167
prompts/s{"name": "input_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1361.359852371545
tokens/s{"name": "output_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
858.1577639458493
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.407268153966697
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
442.9448600156706
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.873385234131611
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3970.219864984901
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7.492620536692848
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3843.714335323431
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.7130319623673949
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
558.6836761212692
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
389.3561767477934
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
21.103234859219995
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2743.4205316985995
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
13.923423426897156
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3578.319820712569
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
17.724458048825532
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2304.1795463473195
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.529499303696411
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1118.4630343483557
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
806.8011868333552
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
0.9144514774798328
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
118.87869207237826
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
12.006478920395466
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1560.8422596514106
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1.8539691473119937
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
589.0628531548713
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
415.1123439391762
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.409341889682664
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
443.2144456587463
tokens/s{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4.542116862078641
prompts/s{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1439.3514124241005
tokens/s{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1038.1765107757747
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2.0402672822810786
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
4180.50766139393
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5.653707334588195
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
734.9819534964654
tokens/s{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
13.970903916114668
prompts/s{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3590.52230644147
tokens/s{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3.7456710255742784
prompts/s{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
486.93723332465623
tokens/sThis comment was automatically generated by workflow using github-action-benchmark.
fec3563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smaller_is_better
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1826.7876810000416
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
92.30180574333039
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
57.09178000006432
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
12.494657948635467
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11.822835040136072
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2577.4123235000843
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
120.01311407067139
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
83.47934300036286
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
18.917851642837732
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
17.005374696378382
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
7477.170930999932
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
155.60422404133442
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
126.69303699999546
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
56.154255656328566
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
55.70570047176915
ms{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5792.872616500063
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
119.60485665071124
ms{"name": "median_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
88.6595430001762
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
52.57606963795908
ms{"name": "median_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
45.77184932528473
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6359.727259500005
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
111.62061250000306
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
71.44836100002294
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
36.14834417896245
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
36.866016742486394
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2027.7279659999294
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
81.131375346716
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
38.80655000011757
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11.574741119740576
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11.62783674047078
ms{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5086.0681250005655
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
90.60802515331791
ms{"name": "median_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
63.84449399956793
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
36.191637028120056
ms{"name": "median_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
30.83664939242367
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
13438.462082000115
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
207.69161295667195
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
162.68896749988926
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
104.1323195408694
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
107.03430961238783
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
2410.390771999573
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
116.01512438667001
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
80.54676449955878
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
17.631069397802065
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
16.065749860137487
ms{"name": "median_request_latency", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
58115.24761200053
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
24693.190178945297
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
22696.127297999737
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
182.07596480055477
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
192.91163983201352
ms{"name": "median_request_latency", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
3638.090063499476
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
141.89584527663706
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
92.82084300048155
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
24.30408247226253
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
23.61374608368908
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1942.1043275001466
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
80.1003962000262
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
38.76818800017645
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11.053056741675453
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11.178593173806847
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
1919.03614299963
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
95.22951738337119
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
59.753891499440215
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
13.171030743815644
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
12.444907479944426
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6011.056845000212
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
119.51665396332828
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
76.40881049997006
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
39.447607174651544
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
39.04213773936851
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6010.300804999929
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
126.94023836000534
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
86.4834220000148
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
40.24737687956218
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
40.325393796500656
ms{"name": "median_request_latency", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
5231.110777500362
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
180.19965554800243
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
169.80007999973168
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
38.8138107661964
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
36.10993344656062
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
6102.600778999886
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
105.56953884001207
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
66.45767050031282
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
34.50484903749174
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
35.01785497536404
ms{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
13024.152943000445
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
166.09906876733658
ms{"name": "median_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
130.501374999767
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
116.45557543230827
ms{"name": "median_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
104.93449016074172
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
73612.5902235001
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
54508.354054376
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
66708.06425700017
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
67.84795469274688
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
64.49598703678805
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
57355.77882799987
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
35571.797603253995
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
35350.50049350002
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
96.58072383981926
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
98.55851647084337
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
249174.2924559994
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
235009.30902119138
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
236050.11426999955
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
68.79708311294138
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
65.43778767154
ms{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
11550.040795000314
ms{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
191.73528062333148
ms{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
148.74351900016336
ms{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
87.77569215848233
ms{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"}
89.93932269039736
msThis comment was automatically generated by workflow using github-action-benchmark.