Release LMDeploy Release v0.6.5 · InternLM/lmdeploy

What's Changed

[dlinfer] feat: add DlinferFlashAttention to support qwen vl. by @Reinerzhou in #2952

refactor PyTorchEngine check env by @grimoire in #2870
refine multi-backend setup.py by @jinminxi104 in #2880
Refactor VLM modules by @lvhan028 in #2810
[dlinfer] only compile the language model in vl models by @tangzhiyi11 in #2893
Optimize tp broadcast by @grimoire in #2889
unfeeze torch version in dockerfile by @RunningLeon in #2906
support tp > n_kv_heads for pt engine by @RunningLeon in #2872
replicate kv for some models when tp is divisble by kv_head_num by @irexyc in #2874
Fallback to pytorch engine when the model is quantized by smooth quant by @lvhan028 in #2953
Torchrun launching multiple api_server by @AllentDan in #2402

[Feature] Support for loading lora adapter weights in safetensors format by @Galaxy-Husky in #2860
fix cpu cache by @grimoire in #2881
Fix args type in docstring by @Galaxy-Husky in #2888
Fix llama3.1 chat template by @fzyzcjy in #2862
Fix typo by @ghntd in #2916
fix: Incorrect stats size during inference of throughput benchmark when concurrency > num_prompts by @pancak3 in #2928
fix lora name and rearange wqkv for internlm2 by @RunningLeon in #2912
[dlinfer] fix moe op for dlinfer. by @Reinerzhou in #2917
[side effect] fix vlm quant failed by @lvhan028 in #2914
fix torch_dtype by @RunningLeon in #2933
support unaligned qkv heads by @grimoire in #2930
fix mllama inference without image by @RunningLeon in #2947
Support torch_dtype modification and update FAQs for AWQ quantization by @AllentDan in #2898
Fix exception handler for proxy server by @AllentDan in #2901
Fix torch_dtype in lite by @AllentDan in #2956
[side-effect] bring back quantization of qwen2-vl, glm4v and etc. by @lvhan028 in #2954
add a thread pool executor to control the vl engine traffic by @lvhan028 in #2970
[side-effect] fix gradio demo error by @lvhan028 in #2976

Full Changelog: v0.6.4...0.6.5