(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.1rc1.dev132+gbaa9bcdef.d20260501
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] █▄█▀ █ █ █ █ model Intel/Qwen3.6-27B-int4-AutoRound
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-04 02:04:43 [utils.py:299]
...
(EngineCore pid=102) INFO 05-04 02:04:57 [gpu_model_runner.py:4815] Starting to load model Intel/Qwen3.6-27B-int4-AutoRound...
(EngineCore pid=102) INFO 05-04 02:04:57 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=102) INFO 05-04 02:04:57 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=102) INFO 05-04 02:04:57 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore pid=102) INFO 05-04 02:04:57 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(EngineCore pid=102) INFO 05-04 02:04:57 [cuda.py:308] Using AttentionBackendEnum.FLASH_ATTN backend.
(EngineCore pid=102) INFO 05-04 02:04:57 [flash_attn.py:646] Using FlashAttention version 2
Loading safetensors using InstantTensor loader: 0% Completed | 0/2013 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader: 58% Completed | 1173/2013 [00:01<00:00, 1172.74it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 2013/2013 [00:01<00:00, 1178.51it/s]
(EngineCore pid=102)
(EngineCore pid=102) INFO 05-04 02:05:03 [default_loader.py:391] Loading weights took 3.49 seconds
...
(APIServer pid=1) INFO 05-04 02:08:38 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
docker run -d --name vllm-node-a \
--gpus all --network=ai-net --ipc=host \
-v ~/tcache/vllm:/root/.cache/vllm \
-v ~/tcache/flashinfer:/root/.cache/flashinfer \
-v ~/tcache/triton:/root/.triton \
-v /opt/models/huggingface:/root/.cache/huggingface \
-p 127.0.0.1:9000:8000 \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
-e HF_TOKEN=... \
vllm-node-tf5-20260501 vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
--served-model-name qwen \
--port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--speculative-config '{"method":"dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768 \
--load-format instanttensor \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--enable-prefix-caching \
--language-model-only
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.1rc1.dev132+gbaa9bcdef.d20260501
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] █▄█▀ █ █ █ █ model Intel/Qwen3.6-27B-int4-AutoRound
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-04 02:16:50 [utils.py:299]
...
(EngineCore pid=102) INFO 05-04 02:17:04 [gpu_model_runner.py:4815] Starting to load model Intel/Qwen3.6-27B-int4-AutoRound...
(EngineCore pid=102) INFO 05-04 02:17:04 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=102) INFO 05-04 02:17:04 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=102) INFO 05-04 02:17:04 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore pid=102) INFO 05-04 02:17:04 [gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(EngineCore pid=102) INFO 05-04 02:17:04 [cuda.py:308] Using AttentionBackendEnum.FLASH_ATTN backend.
(EngineCore pid=102) INFO 05-04 02:17:04 [flash_attn.py:646] Using FlashAttention version 2
Loading safetensors using InstantTensor loader: 0% Completed | 0/2013 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader: 58% Completed | 1160/2013 [00:01<00:00, 1159.64it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 2013/2013 [00:01<00:00, 1192.25it/s]
(EngineCore pid=102)
(EngineCore pid=102) INFO 05-04 02:17:10 [default_loader.py:391] Loading weights took 3.41 seconds
(EngineCore pid=102) INFO 05-04 02:17:10 [gpu_model_runner.py:4839] Loading drafter model...
(EngineCore pid=102) INFO 05-04 02:17:10 [vllm.py:840] Asynchronous scheduling is enabled.
(EngineCore pid=102) INFO 05-04 02:17:10 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=102) INFO 05-04 02:17:10 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=102) INFO 05-04 02:17:10 [cuda.py:368] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=102) INFO 05-04 02:17:11 [weight_utils.py:659] No model.safetensors.index.json found in remote.
Loading safetensors using InstantTensor loader: 0% Completed | 0/58 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 58/58 [00:00<00:00, 204.77it/s]
(EngineCore pid=102)
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in cuMemcpyDtoDAsync_v2
File "<unknown>", line 0, in cudaMemcpyAsync
File "<unknown>", line 0, in at::native::copy_device_to_device(at::TensorIterator&, bool, bool)
File "<unknown>", line 0, in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) [clone .isra.0]
File "<unknown>", line 0, in at::native::copy_(at::Tensor&, at::Tensor const&, bool)
File "<unknown>", line 0, in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool), &torch::ADInplaceOrView::copy_>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)
File "<unknown>", line 0, in torch::autograd::VariableType::(anonymous namespace)::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)
File "<unknown>", line 0, in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool)
File "<unknown>", line 0, in torch::autograd::THPVariable_copy_(_object*, _object*, _object*)
File "<unknown>", line 0, in PyObject_Vectorcall
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in PyIter_Next
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyObject_Call_Prepend
File "<unknown>", line 0, in _PyObject_MakeTpCall
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyObject_Call_Prepend
File "<unknown>", line 0, in PyObject_Call
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in PyEval_EvalCode
File "<unknown>", line 0, in PyRun_StringFlags
File "<unknown>", line 0, in PyRun_SimpleStringFlags
File "<unknown>", line 0, in Py_RunMain
File "<unknown>", line 0, in Py_BytesMain
File "<unknown>", line 0, in _start
File "<unknown>", line 0, in 0xffffffffffffffff
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
I appreciate your project, it's very fast, thank you!
I have successfully used it with this command:
Output:
However, it fails when trying this command:
With output:
NOTE:
./build-and-copy.sh -t vllm-node-tf5-$(date +'%Y%m%d') --tf5 --apply-vllm-pr 40898. However, the problem has existed for a while now and even without that additional pull-request patch on top.instanttensorcrashes with dflash? eugr/spark-vllm-docker#211