02 Feb 02:40

yq33victor

00afcc1

v0.8.0 Latest

Latest

Highlights

Model Support

NPU

Support DeepSeek-v3.2 model.
Support GLM4.7 model.
Support GLM4.6Vmodel.
Support GME-Qwen2-VL model.
Support FluxControl model.

CUDA

Support Qwen2/3 Dense model.

MLU

Support DeepSeek-v3.2 model.
Support Qwen2_5_vl/Qwen3_vl/Qwen3_vl_moe model.

ILU

Support Qwen3-0.6B model.

Feature

Implement chunked prefill and prefix cache for Qwen3 MoE.
Support GLM-4.6V model.
Add wrappers for ATB and ACLNN fused operators.
Optimize prefetch from kv cache store.
Support Qwen2-VL & GME-Qwen2-VL model on npu device.
Fix hang issue when enable schedule overlap.
Add GLM-4.7 detector implementation and update tool call parser.
Adapt hierarchy block manager for disagg PD.
Support deepseek-v3.2-Exp for npu.
Support acl_graph for qwen3/qwen3_moe.
Support prefix cache for deepseek-v3/r1 models.
Support disagg PD for MTP.
Add mooncake kv cache transfer.
Add GLM-4.7 support to reasoning detector registry.
Support nd-to-nz continuous memory copy.
Support RPC-based link/unlink for PD disaggregation.
Support IntraLayerAddNorm, aclgraph, etc for DeepSeek V3.2.
Add activation, norm and rope ops for cuda device.
Support fused norm for Qwen3 and DeepSeek for cuda device.
Build deepseek v2 decoder layer and related model files for mlu device.
Support qwen2_5_vl/qwen3_vl/qwen3_vl_moe on mlu device.
Add moe all2all kernels and deep ep layer on mlu device.
Support deepseek mtp on mlu device.
Support graph executor on mlu device.
Support dp+ep moe and all2all computation on mlu device.
Support parallelized shared experts in fused moe on mlu device.
Support qwen3 0.6B model on iluvatar device.
Add rec proto,serivce and utils for rec framework
Support C api for llm inference.
Add constrained decoding for generative recommendation.
Add rec scheduler master and engine for rec framework.
Add rec_type and onerec batch input builder for rec framework.
Add onerec worker impl for rec framework.
Add qwen3/LlmRec support in rec framework.

Bugfix

Reslove core dump of stream chat completion request when backend is VLM.
Resolve duplicate content in multi-turn tool call conversations.
Fix core dump issue triggered by client disconnection.
Fix the memory leak issue in the completions interface.
Fix wrong positons of validate input when enable MTP.
Resolve kv_cache_num mismatch in ChunkedPrefill due to H2D block copy.
Fix the missing index shape in the allocate kv cache transfer.
Fix MiMo-VL weights loading crash on NPU device.
Fix inaccurate metrics issue when enabling schedule overlap.
Fix potential out-of-range and block leaks during deallocate in D2H copy.
Fix allocation failure in HierarchyBlockManagerPool::allocate.
Fix deepseek accuracy issues with prefix cache enabled.
Resolve Deepseek execution failure caused by invalid input.
Fix DeepSeek failing to run when enabling DP.
Fix the rate_limit bug for stream and non-stream request in PD disagg and refactor some callback logics.
Correct attn mask when prefix cache and MTP are both enabled in deepseek.
Correct precision loss when enabling prefixcache with disagg pd.
Fix incorrect async implementation in rerank interface.
Fix acl_graph_executor not handling q_cu_seq_lens parameter for deepseekv3.2.
Fix precision issue when enabling MTP in PD disaggregation mode.
Fix mrope calculation in the multimodal situation.
Fix core dump of large beam width.

New Contributors

@Hermit-w made their first contribution in #324
@Dragonliu2018 made their first contribution in #609
@jinke446 made their first contribution in #558

Release Images

x86 image

quay.io/jd_xllm/xllm-ai:xllm-0.8.0-release-hb-rc2-x86

ARM a2 device image

quay.io/jd_xllm/xllm-ai:xllm-0.8.0-release-hb-rc2-arm

ARM a3 device image

quay.io/jd_xllm/xllm-ai:xllm-0.8.0-release-hc-rc2-arm

CUDA device image

quay.io/jd_xllm/xllm-ai:xllm-0.8.0-release-cuda-x86

Contributors

jinke446, Dragonliu2018, and Hermit-w

Assets 2

25 Dec 08:47

RobbieLeung

v0.7.2

075af2d

v0.7.2

Release xllm 0.7.2

Highlights

Feature

Enhance Qwen3-MoE to support TP settings beyond 4.
Implement chunked prefill and prefix cache for Qwen3 MoE.
Support prefix cache for DeepSeek-V3/R1 models.

Bugfix

Fix core dump issue triggered by client disconnection.
Fix the incorrect reading of model args from Qwen3-VL's config.json.
Setup the tokenizer config function of bos and eos to fast tokenizer.
Fix the memory leak issue.
Fix hang issue when enable schedule overlap.

Release Images

x86 image

quay.io/jd_xllm/xllm-ai:xllm-0.7.2-release-hb-rc2-x86

ARM a2 device image

quay.io/jd_xllm/xllm-ai:xllm-0.7.2-release-hb-rc2-arm

ARM a3 device image

quay.io/jd_xllm/xllm-ai:xllm-0.7.2-release-hc-rc2-arm

Assets 2

20 Nov 14:01

JimHsiung

v0.7.1

2ed4e74

v0.7.1

Highlights

Model Support

Support GLM-4.5-Air.
Support Qwen3-VL-Moe.

Feature

Support scheduler overlap when enable chunked prefill and MTP.
Enable multi-process mode when running VLM model.
Support AclGraph for GLM-4.5.

Bugfix

Reslove core dump of qwen embedding 0.6B.
Resolve duplicate content in multi-turn tool call conversations.
Support sampler parameters for MTP.
Enable MTP and schedule overlap to work simultaneously.
Resolve google.protobuf.Struct parsing failures which broke tool_call and think toggle functionality.
Fix the precision issue in the Qwen2 model caused by model_type is not be assigned.
Fix core dump of GLM 4.5 when enable MTP.
Temporarily use heap allocation for VLM backend.
Reslove core dump of stream chat completion request for VLM.

Assets 2

20 Nov 13:03

JimHsiung

v0.7.0

53b6e6f

v0.7.0

Highlights

Model Support

Support GLM-4.5.
Support Qwen3-Embedding.
Support Qwen3-VL.
Support FluxFill.

Feature

Support MLU backend, currently supports Qwen3 series models.
Support dynamic disaggregated PD, with dynamic switching between P and D phases based on strategy.
Support multi-stream parallel overlap optimization.
Support beam-search capability in generative models.
Support virtual memory continuous kv-cache capability.
Support ACL graph executor.
Support unified online-offline co-location scheduling in disaggregated PD scenarios.
Support PrefillOnly Scheduler.
Support v1/rerank model service interface.
Support communication between devices via shared memory instead of RPC on a single machine.
Support function call.
Support reasoning output in chat interface.
Support top-k+add fusion in the router component of MoE models.
Support offline inference for LLM, VLM, and Embedding models.
Optimized certain runtime performance.

Bugfix

Skip cancelled requests when processing stream output.
Resolve segmentation fault during qwen3 quantized inference.
Fix the alignment of monitoring metrics format for Prometheus.
Clear outdated tensors to save memory when loading model weights.
Fix attention mask to support long sequence requests.
Fix bugs caused by enabling scheduler overlap.

Assets 2

31 Oct 02:41

JimHsiung

v0.6.1

a0ca5b4

v0.6.1

Highlights

Bugfix

Skip cancelled requests when processing stream output.
Resolve segmentation fault during qwen3 quantized inference.
Fix the alignment of monitoring metrics format for Prometheus.
Clear outdated tensors to save memory when loading model weights.

Release Images

x86 image

quay.io/jd_xllm/xllm-ai:xllm-0.6.1-release-hb-rc2-x86

ARM a2 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.1-release-hb-rc2-arm

ARM a3 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.1-release-hc-rc2-arm

Assets 2

15 Sep 14:31

yq33victor

v0.6.0

97aa650

v0.6.0

Highlights

Model Support

Support DeepSeek-V3/R1.
Support DeepSeek-R1-Distill-Qwen.
Support Kimi-k2.
Support Llama2/3.
Support Qwen2/2.5/QwQ.
Support Qwen3/Qwen3-MoE.
Support MiniCPM-V.
Support MiMo-VL.
Support Qwen2.5-VL .

Feature

Support KV cache store.
Support Expert Parallelism Load Balance.
Support multi-priority on/offline scheduler.
Support latency-aware scheduler.
Support serving early stop.
Optimize ppmatmul kernel.
Support image url input for VLM.
Support disaggregated prefill and decoding.
Support large-scale EP parallelism.
Support Hash-based PrefixCache matching.
Support Multi-Token Prediction for DeepSeek.
Support asynchronous scheduling, allowing the scheduling and computational pipeline to execute in parallel.
Support EP, DP, TP model parallel.
Support multiple process and multiple nodes.

Docs

Add getting started docs.
Add features docs.

Release Images

x86 image

quay.io/jd_xllm/xllm-ai:xllm-0.6.0-release-hb-rc2-py3.11-oe24.03-lts-x86

ARM a2 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.0-release-hb-rc2-py3.11-oe24.03-lts-arm

ARM a3 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.0-release-hc-rc2-py3.11-oe24.03-lts-arm

Assets 2

Releases: jd-opensource/xllm

v0.8.0

Highlights

Model Support

NPU

CUDA

MLU

ILU

Feature

Bugfix

New Contributors

Release Images

x86 image

ARM a2 device image

ARM a3 device image

CUDA device image

Contributors

Uh oh!

v0.7.2

Release xllm 0.7.2

Highlights

Feature

Bugfix

Release Images

x86 image

ARM a2 device image

ARM a3 device image

Uh oh!

v0.7.1

Highlights

Model Support

Feature

Bugfix

Uh oh!

v0.7.0

Highlights

Model Support

Feature

Bugfix

Uh oh!

v0.6.1

Highlights

Bugfix

Release Images

x86 image

ARM a2 device image

ARM a3 device image

Uh oh!

v0.6.0

Highlights

Model Support

Feature

Docs

Release Images

x86 image

ARM a2 device image

ARM a3 device image

Uh oh!