Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Commit

Permalink
Upstream sync 2024 05 19 (#249)
Browse files Browse the repository at this point in the history
Upstream sync 2024 05 25 (#249)

SUMMARY:
Merge commits from
vllm-project@c7f2cf2
to
vllm-project@f68470e

Note that
vllm-project@c7f2cf2
is NOT included in this merge.

---

<details>
<!-- inside this <details> section, markdown rendering does not work, so
we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>

<p>Thank you for your contribution to vLLM! Before submitting the pull
request, please ensure the PR meets the following criteria. This helps
vLLM maintain the code quality and improve the efficiency of the review
process.</p>

<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed
appropriately to indicate the type of change. Please use one of the
following:</p>
<ul>
    <li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration
improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing
model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other
compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
<code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
<code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes.
Vendor name should appear in the prefix (e.g.,
<code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories.
Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please
include all relevant prefixes.</p>

<h3>Code Quality</h3>

<p>The PR need to meet the following code quality standards:</p>

<ul>
<li>We adhere to <a
href="https://google.github.io/styleguide/pyguide.html">Google Python
style guide</a> and <a
href="https://google.github.io/styleguide/cppguide.html">Google C++
style guide</a>.</li>
<li>Pass all linter checks. Please use <a
href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
to format your code.</li>
<li>The code need to be well-documented to ensure future contributors
can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and
robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR
modifies the user-facing behaviors of vLLM. It helps vLLM user
understand and utilize the new features or changes.</li>
</ul>

<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major
architectural changes (>500 LOC excluding kernel/data/config/test), we
would expect a GitHub issue (RFC) discussing the technical design and
justification. Otherwise, we will tag it with <code>rfc-required</code>
and might not go through the PR.</p>

<h3>What to Expect for the Reviews</h3>

<p>The goal of the vLLM team is to be a <i>transparent reviewing
machine</i>. We would like to make the review process transparent and
efficient and make sure no contributor feel confused or frustrated.
However, the vLLM team is small, so we need to prioritize some PRs over
others. Here is what you can expect from the review process: </p>

<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer.
Every reviewer will pick up the PRs based on their expertise and
availability.</li>
<li> After the PR is assigned, the reviewer will provide status update
every 2-3 days. If the PR is not reviewed within 7 days, please feel
free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code>
action-required</code> label on the PR if there are changes required.
The contributor should address the comments and ping the reviewer to
re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a
comment isn't clear or you disagree with a suggestion, feel free to ask
for clarification or discuss the suggestion.
 </li>
</ul>

<h3>Thank You</h3>

<p> Finally, thank you for taking the time to read these guidelines and
for your interest in contributing to vLLM. Your contributions make vLLM
a great tool for everyone! </p>


</details>

---------

Signed-off-by: kerthcet <[email protected]>
Co-authored-by: zhaoyang-star <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Cade Daniel <[email protected]>
Co-authored-by: Noam Gat <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Austin Veselka <[email protected]>
Co-authored-by: leiwen83 <[email protected]>
Co-authored-by: Lei Wen <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: alexm-nm <[email protected]>
Co-authored-by: Mahmoud Ashraf <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: kliuae <[email protected]>
Co-authored-by: miloice <[email protected]>
Co-authored-by: Hao Zhang <[email protected]>
Co-authored-by: Dash Desai <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Allen.Dou <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: Steve Grubb <[email protected]>
Co-authored-by: heeju-kim2 <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: Yikang Shen <[email protected]>
Co-authored-by: Swapnil Parekh <[email protected]>
Co-authored-by: Sanger Steel <[email protected]>
Co-authored-by: Stephen Krider <[email protected]>
Co-authored-by: LiuXiaoxuanPKU <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: SAHIL SUNEJA <[email protected]>
Co-authored-by: zifeitong <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Cade Daniel <[email protected]>
Co-authored-by: Jinzhen Lin <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Pierre Dulac <[email protected]>
Co-authored-by: Alexander Matveev <[email protected]>
Co-authored-by: Hongxia Yang <[email protected]>
Co-authored-by: Silencio <[email protected]>
Co-authored-by: Silencio <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Kante Yin <[email protected]>
Co-authored-by: bofeng huang <[email protected]>
Co-authored-by: eigenLiu <[email protected]>
Co-authored-by: alexeykondrat <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Domenic Barbuzzi <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 303 changed files with 13,409 additions and 4,737 deletions.
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 100
MAX_SIZE_MB = 150


def print_top_10_largest_files(zip_file):
Expand Down
13 changes: 7 additions & 6 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This script build the ROCm docker image and runs test inside it.
# This script runs test inside the corresponding ROCm docker container.
set -ex

# Print ROCm version
Expand All @@ -19,15 +19,16 @@ done

echo "--- Building container"
sha=$(git rev-parse --short HEAD)
container_name=rocm_${sha}
image_name=rocm_${sha}
container_name=rocm_${sha}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)
docker build \
-t ${container_name} \
-t ${image_name} \
-f Dockerfile.rocm \
--progress plain \
.

remove_docker_container() {
docker rm -f ${container_name} || docker image rm -f ${container_name} || true
docker rm -f ${container_name} || docker image rm -f ${image_name} || true
}
trap remove_docker_container EXIT

Expand All @@ -39,6 +40,6 @@ docker run \
--rm \
-e HF_TOKEN \
--name ${container_name} \
${container_name} \
/bin/bash -c $(echo $1 | sed "s/^'//" | sed "s/'$//")
${image_name} \
/bin/bash -c "${@}"

7 changes: 4 additions & 3 deletions .buildkite/run-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ cd "$(dirname "${BASH_SOURCE[0]}")/.."
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)

# run python-based benchmarks and upload the result to buildkite
python3 benchmarks/benchmark_latency.py 2>&1 | tee benchmark_latency.txt
python3 benchmarks/benchmark_latency.py --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
bench_latency_exit_code=$?

python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 2>&1 | tee benchmark_throughput.txt
python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
bench_throughput_exit_code=$?

# run server-based benchmarks and upload the result to buildkite
Expand Down Expand Up @@ -74,4 +74,5 @@ if [ $bench_serving_exit_code -ne 0 ]; then
exit $bench_serving_exit_code
fi

/workspace/buildkite-agent artifact upload openai-*.json
rm ShareGPT_V3_unfiltered_cleaned_split.json
/workspace/buildkite-agent artifact upload "*.json"
67 changes: 49 additions & 18 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@

steps:
- label: Regression Test
mirror_hardwares: [amd]
command: pytest -v -s test_regression.py
working_dir: "/vllm-workspace/tests" # optional

- label: AsyncEngine Test
#mirror_hardwares: [amd]
command: pytest -v -s async_engine

- label: Basic Correctness Test
mirror_hardwares: [amd]
commands:
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_basic_correctness.py
Expand All @@ -24,34 +27,40 @@ steps:
command: pytest -v -s core

- label: Distributed Comm Ops Test
command: pytest -v -s test_comm_ops.py
working_dir: "/vllm-workspace/tests/distributed"
#mirror_hardwares: [amd]
command: pytest -v -s distributed/test_comm_ops.py
working_dir: "/vllm-workspace/tests"
num_gpus: 2

- label: Distributed Tests
working_dir: "/vllm-workspace/tests/distributed"

num_gpus: 2 # only support 1 or 2 for now.
mirror_hardwares: [amd]

working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
- pytest -v -s test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_chunked_prefill_distributed.py
- pytest -v -s distributed/test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py

- label: Distributed Tests (Multiple Groups)
working_dir: "/vllm-workspace/tests/distributed"
#mirror_hardwares: [amd]
working_dir: "/vllm-workspace/tests"
num_gpus: 4
commands:
- pytest -v -s test_pynccl.py
- pytest -v -s distributed/test_pynccl.py

- label: Engine Test
mirror_hardwares: [amd]
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
#mirror_hardwares: [amd]
commands:
# these tests have to be separated, because each one will allocate all posible GPU memory
- pytest -v -s entrypoints --ignore=entrypoints/test_server_oot_registration.py
Expand All @@ -62,21 +71,24 @@ steps:
mirror_hardwares: [amd]
commands:
# install aws cli for llava_example.py
- pip install awscli
# install tensorizer for tensorize_vllm_model.py
- pip install awscli tensorizer
- python3 offline_inference.py
- python3 offline_inference_with_prefix.py
- python3 llm_engine_example.py
- python3 llava_example.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors

- label: Kernels Test %N
#mirror_hardwares: [amd]
command: pytest -v -s kernels --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

- label: Models Test
mirror_hardwares: [amd]
#mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models --ignore=models/test_llava.py --ignore=models/test_mistral.py
- pytest -v -s models --ignore=models/test_llava.py

- label: Llava Test
mirror_hardwares: [amd]
Expand All @@ -90,6 +102,7 @@ steps:
- pytest -v -s prefix_caching

- label: Samplers Test
#mirror_hardwares: [amd]
command: pytest -v -s samplers

- label: LogitsProcessor Test
Expand All @@ -101,20 +114,38 @@ steps:
command: pytest -v -s worker

- label: Speculative decoding tests
mirror_hardwares: [amd]
#mirror_hardwares: [amd]
command: pytest -v -s spec_decode

- label: LoRA Test %N
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
#mirror_hardwares: [amd]
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py
parallelism: 4

- label: LoRA Long Context (Distributed)
#mirror_hardwares: [amd]
num_gpus: 4
# This test runs llama 13B, so it is required to run on 4 GPUs.
commands:
# Temporarily run this way because we cannot clean up GPU mem usage
# for multi GPU tests.
# TODO(sang): Fix it.
- pytest -v -s lora/test_long_context.py::test_rotary_emb_replaced
- pytest -v -s lora/test_long_context.py::test_batched_rope_kernel
- pytest -v -s lora/test_long_context.py::test_self_consistency
- pytest -v -s lora/test_long_context.py::test_quality
- pytest -v -s lora/test_long_context.py::test_max_len

- label: Tensorizer Test
#mirror_hardwares: [amd]
command: apt-get install curl libsodium23 && pytest -v -s tensorizer_loader

- label: Metrics Test
mirror_hardwares: [amd]
command: pytest -v -s metrics

- label: Quantization Test
#mirror_hardwares: [amd]
command: pytest -v -s quantization

- label: Benchmarks
Expand Down
9 changes: 6 additions & 3 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@
{% set default_working_dir = "/vllm-workspace/tests" %}

steps:

- label: ":docker: build image"
commands:
commands:
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
- "docker push {{ docker_image }}"
env:
Expand All @@ -14,6 +13,8 @@ steps:
automatic:
- exit_status: -1 # Agent was lost
limit: 5
- exit_status: -10 # Agent was lost
limit: 5
- wait

- group: "AMD Tests"
Expand All @@ -24,7 +25,7 @@ steps:
- label: "AMD: {{ step.label }}"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh "'cd {{ (step.working_dir or default_working_dir) | safe }} && {{ step.command or (step.commands | join(' && ')) | safe }}'"
command: bash .buildkite/run-amd-test.sh "cd {{ (step.working_dir or default_working_dir) | safe }} ; {{ step.command or (step.commands | join(" ; ")) | safe }}"
env:
DOCKER_BUILDKIT: "1"
{% endif %}
Expand Down Expand Up @@ -53,6 +54,8 @@ steps:
automatic:
- exit_status: -1 # Agent was lost
limit: 5
- exit_status: -10 # Agent was lost
limit: 5
plugins:
- kubernetes:
podSpec:
Expand Down
8 changes: 8 additions & 0 deletions .github/actions/nm-install-whl/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,13 @@ runs:
BASE=$(./.github/scripts/convert-version ${{ inputs.python }})
WHL=$(find . -type f -iname "*${BASE}*.whl")
WHL_BASENAME=$(basename ${WHL})
echo "whl=${WHL_BASENAME}" >> "$GITHUB_OUTPUT"
pip3 install ${WHL}[sparse] --extra-index-url https://pypi.neuralmagic.com/simple
# report magic_wand version
MAGIC_WAND=$(pip3 show nm-magic-wand-nightly | grep "Version" | cut -d' ' -f2) || echo "nightly not installed"
if [ -z "${MAGIC_WAND}" ]; then
# if neither magic-wand nor magic-wand-nightly is installed stop here with error
MAGIC_WAND=$(pip3 show nm-magic-wand | grep "Version" | cut -d' ' -f2)
fi
echo "magic_wand=${MAGIC_WAND}" >> "$GITHUB_OUTPUT"
shell: bash
2 changes: 1 addition & 1 deletion .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ on:
build_timeout:
description: "time limit for build in minutes "
type: string
default: "60"
default: "120"
Gi_per_thread:
description: 'requested GiB to reserve per thread'
type: string
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: aws-test-4-a10g-96G
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-nightly.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_nightly_configs_list.txt
Expand All @@ -45,7 +45,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: aws-test-4-a10g-96G
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-nightly.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_nightly_configs_list.txt
Expand Down Expand Up @@ -81,7 +81,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: aws-test-4-a10g-96G
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-nightly.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_nightly_configs_list.txt
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ jobs:

- name: Setup ccache
uses: hendrikmuhs/[email protected]
with:
create-symlink: true
key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}

- name: Set up Linux Env
if: ${{ runner.os == 'Linux' }}
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/remote-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: ignore
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-remote-push.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_remote_push_configs_list.txt
Expand All @@ -36,7 +36,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: ignore
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-remote-push.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_remote_push_configs_list.txt
Expand All @@ -52,7 +52,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: ignore
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-remote-push.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_remote_push_configs_list.txt
Expand All @@ -68,7 +68,7 @@ jobs:
test_label_solo: aws-test-a10g-24G
test_label_multi: ignore
test_timeout: 480
test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt
test_skip_list: neuralmagic/tests/skip-for-remote-push.txt

benchmark_label: aws-test-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_remote_push_configs_list.txt
Expand Down
Loading

2 comments on commit fec3563

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigger_is_better

Benchmark suite Current: fec3563 Previous: 0194675 Ratio
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.0737548282729463 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4249.123643131267 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.409011020079548 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 443.17143261034124 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 49.425915520304486 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3212.6845088197915 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9839046091034234 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 290.10755367617475 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 212.41516605933808 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.4478319372827997 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 755.9557777512562 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 560.8048243833222 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 8.110949351685584 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4160.917017414704 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6.777022315825469 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 881.012901057311 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 12.005617992926398 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1560.7303390804318 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 24.467710462959356 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3156.334649721757 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 24.18970475825256 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3120.47191381458 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.001360152379693 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4100.786952225991 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.7600839340665586 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3854.0860324182227 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5.3909289544752665 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 700.8207640817847 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 47.68354528965047 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3099.4304438272807 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.2317097461771 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 689.2114818793865 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 512.2874844284559 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.24076117500663205 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 31.298952750862167 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5.760386955245343 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2693.326524794513 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - 2:4 Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5.7688721055275005 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - 2:4 Sparse (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2697.2938416604384 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7.482753413234599 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3838.6525009893494 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 10.643209396798829 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1383.6172215838478 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.7619252796686458 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3855.9734116603618 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 10.48138809173603 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1362.5804519256837 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4.054618484357738 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4155.983946466682 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.303551043568131 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 711.3979902816674 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 461.2937749779968 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.7634225711787357 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3857.5081354582044 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.801382309359772 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 234.17970021677039 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 13.92541738641096 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3578.8322683076167 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.46406416959294255 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 122.67072259019844 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 115.10028910357222 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7.485233655042069 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3839.924865036581 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.24061529216630784 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 31.27998798162002 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 15.572015486541934 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4002.007980041277 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 14.261056511482876 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3665.091523451099 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6.488597325839783 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 843.5176523591718 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 9.9663761626642 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1295.6289011463462 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11.723221360700387 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1524.0187768910505 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.7611788565110156 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 98.95325134643203 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 26.10767148263572 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1696.998646371322 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.8035648075244035 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 234.46342497817247 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.018507334458996 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 132.4059534796695 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 24.48509258168909 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3158.5769430378928 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.49190563133344684 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 130.03033458668335 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 121.99259657069481 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6.778380527731943 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 881.1894686051526 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.0394265863400007 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4178.785075410661 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.800964755162785 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 234.12541817116204 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.24071980892303801 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 31.29357515999494 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.47503736595213275 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 125.57137731578676 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 99.44115527264645 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.2383631716438455 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1526.4542947650727 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 8.054568931292382 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4131.9938617529915 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 28.95521254051672 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3735.2224177266567 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9148041387050826 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 118.92453803166073 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4.468047229100103 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1415.8794864295314 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1021.09772675701 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9137033569406621 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 118.78143640228608 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.45310324945175 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 757.5836995173523 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 563.2226936611239 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5.763501051583438 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2694.7825516783523 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.796880131115109 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 363.59441704496413 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 28.732493074833176 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3706.4916066534797 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.5825356745967367 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1135.2697299229599 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 819.2709765665965 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6.7729068231492615 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 880.4778870094041 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9838451660646144 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 290.09002669803846 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 212.35314064338635 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.722598294088214 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 93.9377782314678 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7.844609518894351 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4024.284683192802 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.06421982745743 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4229.586426460274 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.49190931619773914 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 130.03130864371036 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 122.0263043714525 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 24.088141013062444 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1565.7291658490587 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9839812919439876 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 290.13016386732454 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 212.41860136722823 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 23.67526420447888 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3054.109082377775 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4.164145328681783 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4268.248961898827 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.9294087791973236 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 380.8231412956521 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9682411901603802 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 288.7069306113883 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 210.0792910290977 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.2569725363697384 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 33.40642972806599 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 15.836319601059488 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4069.9341374722885 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9476731573209913 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 279.4245893466187 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 204.5678866498336 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7.364814804763553 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3443.492810115247 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 26.18057918319307 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1701.7376469075496 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.8568163241392224 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1481.0174684694614 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.3587430760558656 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 728.4427617014128 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 541.328390964053 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.9917095136303464 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 258.92223677194505 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 17.06325742782891 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2218.223465617758 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.4662566113130598 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 126.08200445386888 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 118.14631692932059 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.040568251732798 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4181.124347800503 tokens/s
{"name": "request_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7.171179206337352 prompts/s
{"name": "token_throughput", "description": "VLLM Engine throughput - Dense (with dataset)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"dataset\": \"sharegpt\",\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3352.9565497150925 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 19.960383180376216 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2594.8498134489078 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 25.564807771210155 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 64,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1661.71250512866 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4.296001301308167 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1361.359852371545 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 858.1577639458493 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.407268153966697 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 442.9448600156706 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.873385234131611 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 1024,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3970.219864984901 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7.492620536692848 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 512,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3843.714335323431 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.7130319623673949 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 558.6836761212692 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 389.3561767477934 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 21.103234859219995 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2743.4205316985995 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 13.923423426897156 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - Dense (synthetic)\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3578.319820712569 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 17.724458048825532 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 32\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2304.1795463473195 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.529499303696411 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1118.4630343483557 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 806.8011868333552 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 0.9144514774798328 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 118.87869207237826 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 12.006478920395466 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 64,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1560.8422596514106 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1.8539691473119937 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 589.0628531548713 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 415.1123439391762 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.409341889682664 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16,\n \"sparsity\": \"sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 443.2144456587463 tokens/s
{"name": "request_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4.542116862078641 prompts/s
{"name": "input_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1439.3514124241005 tokens/s
{"name": "output_throughput", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1038.1765107757747 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2.0402672822810786 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2048,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 4180.50766139393 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5.653707334588195 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 8\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 734.9819534964654 tokens/s
{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 13.970903916114668 prompts/s
{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3590.52230644147 tokens/s
{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3.7456710255742784 prompts/s
{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 486.93723332465623 tokens/s

This comment was automatically generated by workflow using github-action-benchmark.

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller_is_better

Benchmark suite Current: fec3563 Previous: 0194675 Ratio
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1826.7876810000416 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 92.30180574333039 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 57.09178000006432 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 12.494657948635467 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11.822835040136072 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2577.4123235000843 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 120.01311407067139 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 83.47934300036286 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 18.917851642837732 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 17.005374696378382 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 7477.170930999932 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 155.60422404133442 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 126.69303699999546 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 56.154255656328566 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 55.70570047176915 ms
{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5792.872616500063 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 119.60485665071124 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 88.6595430001762 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 52.57606963795908 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 45.77184932528473 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6359.727259500005 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 111.62061250000306 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 71.44836100002294 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 36.14834417896245 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 36.866016742486394 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2027.7279659999294 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 81.131375346716 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 38.80655000011757 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11.574741119740576 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11.62783674047078 ms
{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5086.0681250005655 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 90.60802515331791 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 63.84449399956793 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 36.191637028120056 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 30.83664939242367 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 13438.462082000115 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 207.69161295667195 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 162.68896749988926 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 104.1323195408694 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 107.03430961238783 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 2410.390771999573 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 116.01512438667001 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 80.54676449955878 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 17.631069397802065 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 16.065749860137487 ms
{"name": "median_request_latency", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 58115.24761200053 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 24693.190178945297 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 22696.127297999737 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 182.07596480055477 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 192.91163983201352 ms
{"name": "median_request_latency", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 3638.090063499476 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 141.89584527663706 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 92.82084300048155 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 24.30408247226253 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 23.61374608368908 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1942.1043275001466 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 80.1003962000262 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 38.76818800017645 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11.053056741675453 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11.178593173806847 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 1919.03614299963 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 95.22951738337119 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 59.753891499440215 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 13.171030743815644 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 12.444907479944426 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6011.056845000212 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 119.51665396332828 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 76.40881049997006 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 39.447607174651544 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 39.04213773936851 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6010.300804999929 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 126.94023836000534 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 86.4834220000148 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 40.24737687956218 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 40.325393796500656 ms
{"name": "median_request_latency", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 5231.110777500362 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 180.19965554800243 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 169.80007999973168 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 38.8138107661964 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 36.10993344656062 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 6102.600778999886 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 105.56953884001207 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 66.45767050031282 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 34.50484903749174 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"150,0.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 35.01785497536404 ms
{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 13024.152943000445 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 166.09906876733658 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 130.501374999767 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 116.45557543230827 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 104.93449016074172 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 73612.5902235001 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 54508.354054376 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 66708.06425700017 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 67.84795469274688 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 64.49598703678805 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 57355.77882799987 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 35571.797603253995 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 35350.50049350002 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 96.58072383981926 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - teknium/OpenHermes-2.5-Mistral-7B\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 98.55851647084337 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 249174.2924559994 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 235009.30902119138 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 236050.11426999955 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 68.79708311294138 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 65.43778767154 ms
{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 11550.040795000314 ms
{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 191.73528062333148 ms
{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 148.74351900016336 ms
{"name": "mean_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 87.77569215848233 ms
{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.4.0", "python_version": "3.11.4 (main, Jun 7 2023, 10:57:56) [GCC 9.4.0]", "torch_version": "2.3.0+cu121"} 89.93932269039736 ms

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.