Crash when loading DFlash drafter model

I appreciate your project, it's very fast, thank you!


I have successfully used it with this command:
```
docker run -d --name vllm-node-a \
        --gpus all --network=ai-net --ipc=host \
        -v ~/tcache/vllm:/root/.cache/vllm \
        -v ~/tcache/flashinfer:/root/.cache/flashinfer \
        -v ~/tcache/triton:/root/.triton \
        -v /opt/models/huggingface:/root/.cache/huggingface \
        -p 127.0.0.1:9000:8000 \
        -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
        -e HF_TOKEN=... \
        vllm-node-tf5-20260501 vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
                --served-model-name qwen \
                --port 8000 \
                --max-model-len 262144 \
                --gpu-memory-utilization 0.85 \
                --attention-backend flash_attn \
                --max-num-batched-tokens 32768 \
                --load-format instanttensor \
                --reasoning-parser qwen3 \
                --tool-call-parser qwen3_coder \
                --enable-auto-tool-choice \
                --enable-prefix-caching \
                --language-model-only
```

Output:
```
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] 
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev132+gbaa9bcdef.d20260501
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.6-27B-int4-AutoRound
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] 
...
(EngineCore pid=102) INFO 05-04 02:04:57 [gpu_model_runner.py:4815] Starting to load model Intel/Qwen3.6-27B-int4-AutoRound...
(EngineCore pid=102) INFO 05-04 02:04:57 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=102) INFO 05-04 02:04:57 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=102) INFO 05-04 02:04:57 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore pid=102) INFO 05-04 02:04:57 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(EngineCore pid=102) INFO 05-04 02:04:57 [cuda.py:308] Using AttentionBackendEnum.FLASH_ATTN backend.
(EngineCore pid=102) INFO 05-04 02:04:57 [flash_attn.py:646] Using FlashAttention version 2
Loading safetensors using InstantTensor loader:   0% Completed | 0/2013 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:  58% Completed | 1173/2013 [00:01<00:00, 1172.74it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 2013/2013 [00:01<00:00, 1178.51it/s]
(EngineCore pid=102) 
(EngineCore pid=102) INFO 05-04 02:05:03 [default_loader.py:391] Loading weights took 3.49 seconds
...
(APIServer pid=1) INFO 05-04 02:08:38 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
```


However, it fails when trying this command:
```
docker run -d --name vllm-node-a \
        --gpus all --network=ai-net --ipc=host \
        -v ~/tcache/vllm:/root/.cache/vllm \
        -v ~/tcache/flashinfer:/root/.cache/flashinfer \
        -v ~/tcache/triton:/root/.triton \
        -v /opt/models/huggingface:/root/.cache/huggingface \
        -p 127.0.0.1:9000:8000 \
        -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
        -e HF_TOKEN=... \
        vllm-node-tf5-20260501 vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
                --served-model-name qwen \
                --port 8000 \
                --max-model-len 262144 \
                --gpu-memory-utilization 0.85 \
                --speculative-config '{"method":"dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
                --attention-backend flash_attn \
                --max-num-batched-tokens 32768 \
                --load-format instanttensor \
                --reasoning-parser qwen3 \
                --tool-call-parser qwen3_coder \
                --enable-auto-tool-choice \
                --enable-prefix-caching \
                --language-model-only
```
With output:
```
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] 
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev132+gbaa9bcdef.d20260501
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.6-27B-int4-AutoRound
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] 
...
(EngineCore pid=102) INFO 05-04 02:17:04 [gpu_model_runner.py:4815] Starting to load model Intel/Qwen3.6-27B-int4-AutoRound...
(EngineCore pid=102) INFO 05-04 02:17:04 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=102) INFO 05-04 02:17:04 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=102) INFO 05-04 02:17:04 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore pid=102) INFO 05-04 02:17:04 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(EngineCore pid=102) INFO 05-04 02:17:04 [cuda.py:308] Using AttentionBackendEnum.FLASH_ATTN backend.
(EngineCore pid=102) INFO 05-04 02:17:04 [flash_attn.py:646] Using FlashAttention version 2
Loading safetensors using InstantTensor loader:   0% Completed | 0/2013 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:  58% Completed | 1160/2013 [00:01<00:00, 1159.64it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 2013/2013 [00:01<00:00, 1192.25it/s]
(EngineCore pid=102) 
(EngineCore pid=102) INFO 05-04 02:17:10 [default_loader.py:391] Loading weights took 3.41 seconds
(EngineCore pid=102) INFO 05-04 02:17:10 [gpu_model_runner.py:4839] Loading drafter model...
(EngineCore pid=102) INFO 05-04 02:17:10 [vllm.py:840] Asynchronous scheduling is enabled.
(EngineCore pid=102) INFO 05-04 02:17:10 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=102) INFO 05-04 02:17:10 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=102) INFO 05-04 02:17:10 [cuda.py:368] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=102) INFO 05-04 02:17:11 [weight_utils.py:659] No model.safetensors.index.json found in remote.
Loading safetensors using InstantTensor loader:   0% Completed | 0/58 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 58/58 [00:00<00:00, 204.77it/s]
(EngineCore pid=102) 
!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemcpyDtoDAsync_v2
  File "<unknown>", line 0, in cudaMemcpyAsync
  File "<unknown>", line 0, in at::native::copy_device_to_device(at::TensorIterator&, bool, bool)
  File "<unknown>", line 0, in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) [clone .isra.0]
  File "<unknown>", line 0, in at::native::copy_(at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool), &torch::ADInplaceOrView::copy_>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in torch::autograd::VariableType::(anonymous namespace)::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in torch::autograd::THPVariable_copy_(_object*, _object*, _object*)
  File "<unknown>", line 0, in PyObject_Vectorcall
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in PyIter_Next
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in _PyObject_Call_Prepend
  File "<unknown>", line 0, in _PyObject_MakeTpCall
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in _PyObject_Call_Prepend
  File "<unknown>", line 0, in PyObject_Call
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in PyEval_EvalCode
  File "<unknown>", line 0, in PyRun_StringFlags
  File "<unknown>", line 0, in PyRun_SimpleStringFlags
  File "<unknown>", line 0, in Py_RunMain
  File "<unknown>", line 0, in Py_BytesMain
  File "<unknown>", line 0, in _start
  File "<unknown>", line 0, in 0xffffffffffffffff

(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

NOTE:
1. This is a build of the spark-vllm-docker image by eugr, built recently on 20260501 with `./build-and-copy.sh -t vllm-node-tf5-$(date +'%Y%m%d') --tf5 --apply-vllm-pr 40898` .  However, the problem has existed for a while now and even without that additional pull-request patch on top.
2. Others seem to have bumped into the same issue:  https://github.com/eugr/spark-vllm-docker/issues/211


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when loading DFlash drafter model #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crash when loading DFlash drafter model #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions