Skip to content

Commit b3204e2

Browse files
Merge branch 'main' into dev/kpietkun/sleep_mode
2 parents 776613d + 49593fa commit b3204e2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+2083
-577
lines changed

.cd/benchmark/benchmark_defaults.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ model_text:
1414
- Qwen/Qwen2.5-32B-Instruct
1515
- Qwen/Qwen2.5-72B-Instruct
1616
- Qwen/Qwen2.5-7B-Instruct
17+
- Qwen/Qwen3-0.6B
1718
- ibm-granite/granite-8b-code-instruct-4k
1819
- ibm-granite/granite-20b-code-instruct-8k
1920
DATASET: /workspace/vllm-project/benchmarks/sonnet.txt

.cd/benchmark/benchmark_scenarios_text.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ qwen25_72b_instruct:
4141
qwen25_7b_instruct:
4242
MODEL: Qwen/Qwen2.5-7B-Instruct
4343

44+
Qwen/Qwen3-0.6B:
45+
MODEL: Qwen/Qwen3-0.6B
46+
4447
granite_8b_code_instruct_4k:
4548
MODEL: ibm-granite/granite-8b-code-instruct-4k
4649

.cd/docker-compose.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,5 +43,5 @@ services:
4343
env_file:
4444
- ./benchmark/benchmark_user.env
4545
volumes:
46-
- ./logs:/root/scripts/logs
46+
- /tmp/logs:/root/scripts/logs
4747
command: ["benchmark", "--config-file", "${VLLM_BENCHMARK_CONFIG_FILE}", "--config-name", "${VLLM_BENCHMARK_CONFIG_NAME}"]

.cd/server/settings_vllm.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ Qwen/Qwen2.5-7B-Instruct,1,4352,128,2,15231233024,2,2,14.18519115,0,10,5,128,1,3
1616
ibm-granite/granite-8b-code-instruct-4k,1,4096,128,2,21474836480,2,2,20,0,10,8,128,1,32,1,32,128,256,1,128,256,1,36,4096,8,32,2,32768,1,FALSE,FALSE,2048,FALSE,TRUE,TRUE,1,0
1717
ibm-granite/granite-20b-code-instruct-8k,1,4352,128,2,40133986304,2,2,37.37,0,10,4,128,1,32,1,32,128,256,1,128,256,1,52,6144,1,48,2,65536,1,FALSE,FALSE,2048,FALSE,TRUE,TRUE,1,0
1818
Qwen/Qwen2.5-VL-7B-Instruct,1,8448,128,2,15231233024,2,2,14.18519115,0,12,4,128,1,32,1,32,128,256,1,128,256,1,28,3584,4,28,2,32768,1,FALSE,FALSE,2048,FALSE,FALSE,FALSE,1,0
19+
Qwen/Qwen3-0.6B,1,4352,128,2,1.61E+09,2,2,1.5,0,10,5,128,1,32,1,32,128,256,1,128,256,1,28,1024,8,16,2,32768,1,FALSE,FALSE,2048,FALSE,TRUE,TRUE,1,0

.cd/templates/template_vllm_benchmark.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,4 +35,6 @@ vllm bench serve \
3535
--metric-percentiles 90 \
3636
--ignore-eos \
3737
--trust-remote-code \
38-
2>&1 | tee -a logs/perftest_inp${INPUT_TOK}_out${OUTPUT_TOK}_user${CONCURRENT_REQ}.log
38+
--save-result \
39+
--result-dir logs \
40+
--result-filename summary_inp${INPUT_TOK}_out${OUTPUT_TOK}_user${CONCURRENT_REQ}.json 2>&1 | tee -a logs/summary_inp${INPUT_TOK}_out${OUTPUT_TOK}_user${CONCURRENT_REQ}.log #save results to logs on a host

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ vLLM Hardware Plugin for Intel® Gaudi®
1515
---
1616
*Latest News* 🔥
1717

18-
- [2025/11] The 0.10.2 release introduces the production-ready version of the vLLM Hardware Plugin for Intel® Gaudi® v1.23.0. The plugin is an alternative to the [vLLM fork](https://github.com/HabanaAI/vllm-fork), which reaches end of life with this release and will be deprecated in v1.24.0, remaining functional only for legacy use cases. We strongly encourage all fork users to begin planning their migration to the plugin. For more information about this release, see the [Release Notes](docs/release_notes.md).
18+
- [2025/11] The 0.11.2 release introduces the production-ready version of the vLLM Hardware Plugin for Intel® Gaudi® v1.22.2. The plugin is an alternative to the [vLLM fork](https://github.com/HabanaAI/vllm-fork), which reaches end of life with this release and will be deprecated in v1.24.0, remaining functional only for legacy use cases. We strongly encourage all fork users to begin planning their migration to the plugin. For more information about this release, see the [Release Notes](docs/release_notes.md).
1919
- [2025/06] We introduced an early developer preview of the vLLM Hardware Plugin for Intel® Gaudi®, which is not yet intended for general use.
2020

2121
---
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
import torch
2+
from safetensors import safe_open
3+
from safetensors.torch import save_file
4+
from glob import glob
5+
import os
6+
7+
import argparse
8+
9+
10+
def copy_other_files(input_path, output_path):
11+
import shutil
12+
13+
for file in os.listdir(input_path):
14+
if file.endswith(".json") or file.endswith(".py"):
15+
print(f"copying {file} to {output_path}")
16+
shutil.copyfile(
17+
os.path.join(input_path, file),
18+
os.path.join(output_path, file),
19+
)
20+
21+
22+
def convert_files(input_path, output_path):
23+
all_safetensors = glob(f"{input_path}/*.safetensors")
24+
# sort by file name
25+
all_safetensors.sort()
26+
for safetensors_path in all_safetensors:
27+
tensors = {}
28+
print(f"processing {safetensors_path}")
29+
with safe_open(safetensors_path, framework="pt", device="cpu") as tensor_file:
30+
for k in tensor_file:
31+
tensor = tensor_file.get_tensor(k)
32+
if "proj" in k:
33+
if k.endswith("weight"):
34+
tensor = (tensor.float() * 0.5).to(torch.float8_e4m3fn)
35+
elif k.endswith("weight_scale_inv") or k.endswith("input_scale_inv"):
36+
# "scale_inv" in deepseek-r1 is actually "scale"
37+
tensor = tensor.float() * 2
38+
else:
39+
raise NotImplementedError(f"Cannot convert {k}")
40+
else:
41+
print(f"skip {k}.")
42+
tensors[k] = tensor
43+
new_tensor_path = safetensors_path.replace(input_path, output_path)
44+
print(f"saving to {new_tensor_path}")
45+
save_file(tensors, new_tensor_path)
46+
47+
48+
if __name__ == "__main__":
49+
parser = argparse.ArgumentParser(description="Convert tensors to float8fnuz format.")
50+
parser.add_argument(
51+
"-i",
52+
"--input_path",
53+
help="Path to the official model weights.",
54+
required=True,
55+
)
56+
parser.add_argument(
57+
"-o",
58+
"--output_path",
59+
help="Path to the output directory.",
60+
required=True,
61+
)
62+
args = parser.parse_args()
63+
input_path = args.input_path
64+
output_path = args.output_path
65+
66+
# create output directory if it does not exist
67+
if not os.path.exists(output_path):
68+
os.makedirs(output_path)
69+
copy_other_files(input_path, output_path)
70+
convert_files(input_path, output_path)
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Supported JSON Config File Options
2+
3+
The following table summarizes the options for the JSON config file:
4+
5+
| Attribute | Description | Values |
6+
|----------------------|-------------|--------|
7+
| **Mode** | The mode to run INC with. | - **MEASURE** – Measure statistics of all modules and emit the results to `dump_stats_path`.<br>- **QUANTIZE** *(default)* – Quantize and run the model according to the provided measurements. |
8+
| **Observer** | The observer to measure the statistics. | - **maxabs** *(default)*<br>- **save** – Saves all tensors to files. |
9+
| **Allowlist** | List of `nn.Module` names or types to quantize. Empty list means all supported modules are quantized by default. See *supported-modules*. | Default: empty list |
10+
| **Blocklist** | List of `nn.Module` names or types **not** to quantize. | Default: empty list |
11+
| **dump_stats_path** | Path to save and load measurements. Directory structure is created up to the last `/`; the string after the last `/` is used as a prefix for measurement files. | Default: `stats` |
12+
| **scale_method** | Method for calculating the scale from measurements. | - `unit_scale` *(default)* – Always use scale of 1.<br>- `maxabs_arbitrary` – Stretch/compress maxabs to full-scale of FP8.<br>- `maxabs_hw` – Stretch/compress maxabs to full-scale of FP8, then replace with HW-accelerated scale based on `device_for_scales`.<br>- `maxabs_pow2` – Same as above but rounded to power of 2.<br>- `maxabs_hw_opt_weight` – Weight scale chosen for minimal MSE among HW accelerated scales; activations use `maxabs_hw`.<br>- `act_maxabs_pow2_weights_pcs_opt_pow2` – Per-channel weights use `maxabs_hw_opt_weight`; activations use `maxabs_pow2`.<br>- `act_maxabs_hw_weights_pcs_maxabs_pow2` – Per-channel weights use `maxabs_pow2`; activations use `maxabs_hw`.<br>- `act_maxabs_pcs_pow2_weight_maxabs_pts_pow2_hw`**Dynamic quant only**: per-tensor weights use `maxabs_hw`; activations use per-token `maxabs_pow2`. |
13+
| **measure_exclude** | Tensor types to exclude from measurement. | - `NONE` – Measure all tensors.<br>- `OUTPUT` *(default)* – Skip output tensors. |
14+
| **scale_format** | Format of scales passed to custom PyTorch ops. | - `const` – Scales passed as tensors.<br>- `scalar` *(default)* – Scales passed as scalar values for compile-time & throughput optimizations. |
15+
| **device_for_scales**| Exponent-bias values for converting FP32/BF16 to FP8-143. | - `GAUDI3` – Expanded exponent-bias range (0–63).<br>- `GAUDI2` – 4 possible exponent biases (3, 7, 11, 15), default is 7. |
16+
| **dynamic_quantization** | Enables dynamic FP8 quantization with per-token scales. Only supported with `act_maxabs_pcs_pow2_weight_maxabs_pts_pow2_hw`. | - `true` – Enable.<br>- `false` *(default)* – Disable. |
17+
18+
---
19+
20+
## Configuring Backoff Factors
21+
22+
Maxabs-based scaling methods support backoff factors `input_backoff` and `weight_backoff` to leave margin when converting inputs and weights to FP8.
23+
24+
For example, if an activation has a larger absolute value than observed in calibration, the maxabs value is scaled to:
25+
26+
```
27+
input_backoff * FP8_143_FULLSCALE
28+
```
29+
30+
Similarly, for weights:
31+
32+
```
33+
weight_backoff * FP8_143_FULLSCALE
34+
```
35+
36+
Defaults:
37+
- `input_backoff = 0.25`
38+
- `weight_backoff = 0.5`
39+
40+
To change these values, add the following to the quantization configuration JSON file:
41+
42+
```json
43+
"scale_params": {"input_backoff": <INPUT_BACKOFF>, "weight_backoff": <WEIGHT_BACKOFF>}
44+
```
45+
46+
---
47+
48+
## Compile Time and Throughput Optimization
49+
50+
Setting `"scale_format": "scalar"` enables:
51+
52+
- Faster compile time for FP8 inference by reducing the number of compiled recipes.
53+
- Less host-side overhead when launching FP8 ops, improving throughput in host-bound cases (e.g., small batch sizes).
54+
55+
> **Note:**
56+
> - Compile time improvement depends on model properties such as recipe count and scale distribution.
57+
> - Not applicable to PCQ.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"method": "HOOKS",
3+
"mode": "QUANTIZE",
4+
"observer": "maxabs",
5+
"scale_method": "ACT_MAXABS_POW2_WEIGHTS_PCS_OPT_POW2",
6+
"dump_stats_path": "./hqt_output/measure"
7+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"method": "HOOKS",
3+
"mode": "MEASURE",
4+
"observer": "maxabs",
5+
"dump_stats_path": "./hqt_output/measure"
6+
}

0 commit comments

Comments
 (0)