Skip to content

Crash when loading DFlash drafter model #12

@dkopko

Description

@dkopko

I appreciate your project, it's very fast, thank you!

I have successfully used it with this command:

docker run -d --name vllm-node-a \
        --gpus all --network=ai-net --ipc=host \
        -v ~/tcache/vllm:/root/.cache/vllm \
        -v ~/tcache/flashinfer:/root/.cache/flashinfer \
        -v ~/tcache/triton:/root/.triton \
        -v /opt/models/huggingface:/root/.cache/huggingface \
        -p 127.0.0.1:9000:8000 \
        -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
        -e HF_TOKEN=... \
        vllm-node-tf5-20260501 vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
                --served-model-name qwen \
                --port 8000 \
                --max-model-len 262144 \
                --gpu-memory-utilization 0.85 \
                --attention-backend flash_attn \
                --max-num-batched-tokens 32768 \
                --load-format instanttensor \
                --reasoning-parser qwen3 \
                --tool-call-parser qwen3_coder \
                --enable-auto-tool-choice \
                --enable-prefix-caching \
                --language-model-only

Output:

(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] 
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev132+gbaa9bcdef.d20260501
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.6-27B-int4-AutoRound
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] 
...
(EngineCore pid=102) INFO 05-04 02:04:57 [gpu_model_runner.py:4815] Starting to load model Intel/Qwen3.6-27B-int4-AutoRound...
(EngineCore pid=102) INFO 05-04 02:04:57 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=102) INFO 05-04 02:04:57 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=102) INFO 05-04 02:04:57 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore pid=102) INFO 05-04 02:04:57 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(EngineCore pid=102) INFO 05-04 02:04:57 [cuda.py:308] Using AttentionBackendEnum.FLASH_ATTN backend.
(EngineCore pid=102) INFO 05-04 02:04:57 [flash_attn.py:646] Using FlashAttention version 2
Loading safetensors using InstantTensor loader:   0% Completed | 0/2013 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:  58% Completed | 1173/2013 [00:01<00:00, 1172.74it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 2013/2013 [00:01<00:00, 1178.51it/s]
(EngineCore pid=102) 
(EngineCore pid=102) INFO 05-04 02:05:03 [default_loader.py:391] Loading weights took 3.49 seconds
...
(APIServer pid=1) INFO 05-04 02:08:38 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

However, it fails when trying this command:

docker run -d --name vllm-node-a \
        --gpus all --network=ai-net --ipc=host \
        -v ~/tcache/vllm:/root/.cache/vllm \
        -v ~/tcache/flashinfer:/root/.cache/flashinfer \
        -v ~/tcache/triton:/root/.triton \
        -v /opt/models/huggingface:/root/.cache/huggingface \
        -p 127.0.0.1:9000:8000 \
        -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
        -e HF_TOKEN=... \
        vllm-node-tf5-20260501 vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
                --served-model-name qwen \
                --port 8000 \
                --max-model-len 262144 \
                --gpu-memory-utilization 0.85 \
                --speculative-config '{"method":"dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
                --attention-backend flash_attn \
                --max-num-batched-tokens 32768 \
                --load-format instanttensor \
                --reasoning-parser qwen3 \
                --tool-call-parser qwen3_coder \
                --enable-auto-tool-choice \
                --enable-prefix-caching \
                --language-model-only

With output:

(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] 
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev132+gbaa9bcdef.d20260501
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.6-27B-int4-AutoRound
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] 
...
(EngineCore pid=102) INFO 05-04 02:17:04 [gpu_model_runner.py:4815] Starting to load model Intel/Qwen3.6-27B-int4-AutoRound...
(EngineCore pid=102) INFO 05-04 02:17:04 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=102) INFO 05-04 02:17:04 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=102) INFO 05-04 02:17:04 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore pid=102) INFO 05-04 02:17:04 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(EngineCore pid=102) INFO 05-04 02:17:04 [cuda.py:308] Using AttentionBackendEnum.FLASH_ATTN backend.
(EngineCore pid=102) INFO 05-04 02:17:04 [flash_attn.py:646] Using FlashAttention version 2
Loading safetensors using InstantTensor loader:   0% Completed | 0/2013 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:  58% Completed | 1160/2013 [00:01<00:00, 1159.64it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 2013/2013 [00:01<00:00, 1192.25it/s]
(EngineCore pid=102) 
(EngineCore pid=102) INFO 05-04 02:17:10 [default_loader.py:391] Loading weights took 3.41 seconds
(EngineCore pid=102) INFO 05-04 02:17:10 [gpu_model_runner.py:4839] Loading drafter model...
(EngineCore pid=102) INFO 05-04 02:17:10 [vllm.py:840] Asynchronous scheduling is enabled.
(EngineCore pid=102) INFO 05-04 02:17:10 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=102) INFO 05-04 02:17:10 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=102) INFO 05-04 02:17:10 [cuda.py:368] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=102) INFO 05-04 02:17:11 [weight_utils.py:659] No model.safetensors.index.json found in remote.
Loading safetensors using InstantTensor loader:   0% Completed | 0/58 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 58/58 [00:00<00:00, 204.77it/s]
(EngineCore pid=102) 
!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemcpyDtoDAsync_v2
  File "<unknown>", line 0, in cudaMemcpyAsync
  File "<unknown>", line 0, in at::native::copy_device_to_device(at::TensorIterator&, bool, bool)
  File "<unknown>", line 0, in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) [clone .isra.0]
  File "<unknown>", line 0, in at::native::copy_(at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool), &torch::ADInplaceOrView::copy_>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in torch::autograd::VariableType::(anonymous namespace)::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool)
  File "<unknown>", line 0, in torch::autograd::THPVariable_copy_(_object*, _object*, _object*)
  File "<unknown>", line 0, in PyObject_Vectorcall
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in PyIter_Next
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in _PyObject_Call_Prepend
  File "<unknown>", line 0, in _PyObject_MakeTpCall
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in _PyObject_Call_Prepend
  File "<unknown>", line 0, in PyObject_Call
  File "<unknown>", line 0, in _PyEval_EvalFrameDefault
  File "<unknown>", line 0, in PyEval_EvalCode
  File "<unknown>", line 0, in PyRun_StringFlags
  File "<unknown>", line 0, in PyRun_SimpleStringFlags
  File "<unknown>", line 0, in Py_RunMain
  File "<unknown>", line 0, in Py_BytesMain
  File "<unknown>", line 0, in _start
  File "<unknown>", line 0, in 0xffffffffffffffff

(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

NOTE:

  1. This is a build of the spark-vllm-docker image by eugr, built recently on 20260501 with ./build-and-copy.sh -t vllm-node-tf5-$(date +'%Y%m%d') --tf5 --apply-vllm-pr 40898 . However, the problem has existed for a while now and even without that additional pull-request patch on top.
  2. Others seem to have bumped into the same issue: instanttensor crashes with dflash? eugr/spark-vllm-docker#211

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions