Skip to content

[Bug]: W8A16-FP8_Block Quant from llm_Compressor Fails to load on Blackwell SM12.0 #1919

@phaelon74

Description

@phaelon74

⚙️ Your current environment

Full error code below, but FP8_Block fails to load

You can see below, there is a mismatch from vlm 0.11.1 and W8A16-FP8_BLOCK quants. I was under the impression that the PR for FP8_BLOCK and SM12.0 was already incorporated in the mainline?

### Environment Information ###
Operating System: `Linux-6.8.0-85-generic-x86_64-with-glibc2.39`
Python Version: `3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]`
llm-compressor Version: `None`
compressed-tensors Version: `0.11.0`
transformers Version: `4.57.0`
torch Version: `2.8.0+cu129`
CUDA Devices: `['NVIDIA RTX PRO 6000 Blackwell Workstation Edition', 'NVIDIA RTX PRO 6000 Blackwell Workstation Edition']`
AMD Devices: `None`

🐛 Describe the bug

Here's the relevant section:

AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype'
... in compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py
... -> maybe_post_process_fp8_weight_block(...) -> fp8_utils.py

FULL Error Code here:

Loading safetensors checkpoint shards:   0% Completed | 0/26 [00:00<?, ?it/s]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:   4% Completed | 1/26 [00:01<00:30,  1.22s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:   8% Completed | 2/26 [00:02<00:29,  1.24s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  12% Completed | 3/26 [00:03<00:28,  1.22s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  15% Completed | 4/26 [00:04<00:26,  1.20s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  19% Completed | 5/26 [00:06<00:25,  1.23s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  23% Completed | 6/26 [00:07<00:25,  1.29s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  27% Completed | 7/26 [00:08<00:24,  1.31s/it]
�[1m�[36m(APIServer pid=71476)�[0m DEBUG 10-13 14:12:27 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  31% Completed | 8/26 [00:10<00:23,  1.28s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  35% Completed | 9/26 [00:11<00:21,  1.27s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  38% Completed | 10/26 [00:12<00:20,  1.29s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  42% Completed | 11/26 [00:13<00:19,  1.29s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  46% Completed | 12/26 [00:15<00:17,  1.26s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  50% Completed | 13/26 [00:16<00:16,  1.25s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  54% Completed | 14/26 [00:17<00:15,  1.30s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  58% Completed | 15/26 [00:19<00:14,  1.32s/it]
�[1m�[36m(APIServer pid=71476)�[0m DEBUG 10-13 14:12:37 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  62% Completed | 16/26 [00:20<00:12,  1.27s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  65% Completed | 17/26 [00:21<00:11,  1.24s/it]
�[1m�[36m(Worker_TP1 pid=71620)�[0m DEBUG 10-13 14:12:39 [model_executor/models/utils.py:186] Loaded weight lm_head.weight with shape torch.Size([16384, 12288])
�[1m�[36m(Worker_TP0 pid=71619)�[0m DEBUG 10-13 14:12:39 [model_executor/models/utils.py:186] Loaded weight lm_head.weight with shape torch.Size([16384, 12288])
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  69% Completed | 18/26 [00:22<00:08,  1.04s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  73% Completed | 19/26 [00:23<00:07,  1.09s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  77% Completed | 20/26 [00:24<00:06,  1.15s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  81% Completed | 21/26 [00:25<00:05,  1.17s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  85% Completed | 22/26 [00:27<00:04,  1.18s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  88% Completed | 23/26 [00:28<00:03,  1.24s/it]
�[1m�[36m(APIServer pid=71476)�[0m DEBUG 10-13 14:12:47 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  92% Completed | 24/26 [00:29<00:02,  1.30s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards:  96% Completed | 25/26 [00:31<00:01,  1.26s/it]
�[1m�[36m(Worker_TP1 pid=71620)�[0m INFO 10-13 14:12:49 [model_executor/model_loader/default_loader.py:267] Loading weights took 32.12 seconds
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:32<00:00,  1.24s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:32<00:00,  1.24s/it]
�[1m�[36m(Worker_TP0 pid=71619)�[0m 
�[1m�[36m(Worker_TP0 pid=71619)�[0m INFO 10-13 14:12:50 [model_executor/model_loader/default_loader.py:267] Loading weights took 32.33 seconds
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597] WorkerProc failed to start.
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597] Traceback (most recent call last):
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     self.worker.load_model()
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     self.model = model_loader.load_model(
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 51, in load_model
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     process_weights_after_loading(model, model_config, target_device)
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     quant_method.process_weights_after_loading(module)
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 718, in process_weights_after_loading
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     layer.scheme.process_weights_after_loading(layer)
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 136, in process_weights_after_loading
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     maybe_post_process_fp8_weight_block(
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 915, in maybe_post_process_fp8_weight_block
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     layer.orig_dtype, layer.weight)
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     ^^^^^^^^^^^^^^^^
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in __getattr__
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597]     raise AttributeError(
�[1m�[36m(Worker_TP1 pid=71620)�[0m ERROR 10-13 14:12:50 [v1/executor/multiproc_executor.py:597] AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype'
�[1m�[36m(Worker_TP1 pid=71620)�[0m INFO 10-13 14:12:50 [v1/executor/multiproc_executor.py:558] Parent process exited, terminating worker
�[1m�[36m(Worker_TP0 pid=71619)�[0m INFO 10-13 14:12:50 [v1/executor/multiproc_executor.py:558] Parent process exited, terminating worker
[rank0]:[W1013 14:12:51.310831696 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708] EngineCore failed to start.
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708] Traceback (most recent call last):
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]     self.model_executor = executor_class(vllm_config)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]     self._init_executor()
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]     self.workers = WorkerProc.wait_for_ready(unready_workers)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708]     raise e from None
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m ERROR 10-13 14:12:52 [v1/engine/core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m Process EngineCore_DP0:
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m Traceback (most recent call last):
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     self.run()
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     self._target(*self._args, **self._kwargs)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     raise e
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     engine_core = EngineCoreProc(*args, **kwargs)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     super().__init__(vllm_config, executor_class, log_stats,
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     self.model_executor = executor_class(vllm_config)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     self._init_executor()
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     self.workers = WorkerProc.wait_for_ready(unready_workers)
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m     raise e from None
�[1m�[36m(EngineCore_DP0 pid=71549)�[0m Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
�[1m�[36m(APIServer pid=71476)�[0m Traceback (most recent call last):
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/bin/vllm", line 8, in <module>
�[1m�[36m(APIServer pid=71476)�[0m     sys.exit(main())
�[1m�[36m(APIServer pid=71476)�[0m              ^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
�[1m�[36m(APIServer pid=71476)�[0m     args.dispatch_function(args)
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 57, in cmd
�[1m�[36m(APIServer pid=71476)�[0m     uvloop.run(run_server(args))
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
�[1m�[36m(APIServer pid=71476)�[0m     return __asyncio.run(
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
�[1m�[36m(APIServer pid=71476)�[0m     return runner.run(main)
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
�[1m�[36m(APIServer pid=71476)�[0m     return self._loop.run_until_complete(task)
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
�[1m�[36m(APIServer pid=71476)�[0m     return await main
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
�[1m�[36m(APIServer pid=71476)�[0m     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
�[1m�[36m(APIServer pid=71476)�[0m     async with build_async_engine_client(
�[1m�[36m(APIServer pid=71476)�[0m   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
�[1m�[36m(APIServer pid=71476)�[0m     return await anext(self.gen)
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
�[1m�[36m(APIServer pid=71476)�[0m     async with build_async_engine_client_from_engine_args(
�[1m�[36m(APIServer pid=71476)�[0m   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
�[1m�[36m(APIServer pid=71476)�[0m     return await anext(self.gen)
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
�[1m�[36m(APIServer pid=71476)�[0m     async_llm = AsyncLLM.from_vllm_config(
�[1m�[36m(APIServer pid=71476)�[0m                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1572, in inner
�[1m�[36m(APIServer pid=71476)�[0m     return fn(*args, **kwargs)
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
�[1m�[36m(APIServer pid=71476)�[0m     return cls(
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
�[1m�[36m(APIServer pid=71476)�[0m     self.engine_core = EngineCoreClient.make_async_mp_client(
�[1m�[36m(APIServer pid=71476)�[0m                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
�[1m�[36m(APIServer pid=71476)�[0m     return AsyncMPClient(*client_args)
�[1m�[36m(APIServer pid=71476)�[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
�[1m�[36m(APIServer pid=71476)�[0m     super().__init__(
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
�[1m�[36m(APIServer pid=71476)�[0m     with launch_core_engines(vllm_config, executor_class,
�[1m�[36m(APIServer pid=71476)�[0m   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
�[1m�[36m(APIServer pid=71476)�[0m     next(self.gen)
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
�[1m�[36m(APIServer pid=71476)�[0m     wait_for_engine_startup(
�[1m�[36m(APIServer pid=71476)�[0m   File "/home/phaedawg/vllm/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
�[1m�[36m(APIServer pid=71476)�[0m     raise RuntimeError("Engine core initialization failed. "
�[1m�[36m(APIServer pid=71476)�[0m RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d 

🛠️ Steps to reproduce

Install VLLM mainline
load a W8A16-FP8_BLOCK quant of a Dense model

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingfp8For any issue / PR related to FP8 supportvllmUsing vLLM

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions