mindspore-lab · SamitHuang · Oct 27, 2025 · Aug 18, 2025 · Aug 19, 2025 · Aug 19, 2025
@@ -0,0 +1,83 @@
+# Qwen3-VL series
+
+## Introduction
+[Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
+
+# Get Started
+
+## Requirements:
+| mindspore | 	ascend driver | firmware       | cann tookit/kernel |
+|-----------|----------------|----------------|--------------------|
+| 2.6.0     | 24.1.RC3.b080  | 7.5.T11.0.B088 | 8.1.RC1            |
+
+### Installation:
+```
+git clone https://github.com/mindspore-lab/mindone.git -b hf-transformers-4.54
+cd mindone
+pip install -e .
+cd ..
+
+# compile newest transformers whl because qwen3-vl(transformers v4.57.dev.0) haven't released
+git clone https://github.com/huggingface/transformers.git
+cd transformers
+git reset --hard d0af4269ec260b9c4aeeda24c346a469e44799e1
+pip install -e .
+cd ..
+
+cd mindone/examples/transformers/qwen3_vl
+```
+
+## Quick Start
+
+Here is a usage example of Qwen3-VL-4B-Instruct. you can use the following command:
+
+```bash
+# for Qwen3-VL-4B-Instruct inference
+python generate_qwen3_vl.py
+    --model_name "Qwen/Qwen3-VL-4B-Instruct"
+    --image "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+    --prompt "Describe this image."
+```
+
+```bash
+# for Qwen3-VL-30B-A3B-Instruct inference
+msrun --worker_num=2 --local_worker_num=2 --master_port=8118 \
+    --log_dir=msrun_log --join=True --cluster_time_out=300 \
+    generate_qwen3_vl_moe.py \
+    --model_name "Qwen/Qwen3-VL-30B-A3B-Instruct" \
+    --image "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" \
+    --prompt "Describe this image." \
+```
+
+Image:
+![sample image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg)
+
+Prompt: Describe this image.
+
+Qwen3-VL-4B Outputs:
+```
+['Of course, here is detailed description of the image provided.\n\n
+ This is a close-up photograph of a Pallas\'s cat ($Felis$, $manul$),
+an endangered wild feline species native to Central Aisa.
+...
+**Appearance:** It has a stocky and robust build with short legs
+and a large head relative to its body size. Its fur is thick and dense,
+appearing somewhat fluffy or "matted,", which is characteristic']
+```
+
+Qwen3-VL-30B Outputs:
+```
+['Of course, here is detailed description of the image provided.\n\n
+This is a dynamic and charming photograph of a Palla's cat (also known as a manul) in a snowy enviroment.
+...
+"Appearance:" The cat has a very distinctive apperance, characterized by its stocky, low-slung body and exceptionally
+thick, dense fur. This coat is a mix of brownish"]
+```
+
+`model_name` and `image` could be replaced with your local path. Give it a try with various images and prompts🤗🤗.
+
+## Inference Speed
+|          model name	           | mindspore version | precision* | cards | attention type | 	tokens/s	 |
+|:------------------------------:|:-----------------:|:----------:|:-----:|:--------------:|:----------:|
+|   Qwen/Qwen3-VL-4B-Instruct    |       2.6.0       |    bf16     |   1   |   flash_attn   |    1.35    |
+| Qwen/Qwen3-VL-30B-A3B-Instruct |       2.6.0       |    bf16    |   2   |   flash_attn   |    0.5     |
@@ -0,0 +1,79 @@
+import argparse
+
+import numpy as np
+
+import mindspore as ms
+
+from mindone.transformers import AutoProcessor, Qwen3VLForConditionalGeneration
+
+
+def generate(args):
+    model = Qwen3VLForConditionalGeneration.from_pretrained(
+        args.model_name,
+        mindspore_dtype=ms.bfloat16,
+        attn_implementation=args.attn_implementation,
+    )
+
+    processor = AutoProcessor.from_pretrained(
+        args.model_name,
+        use_fast=False,
+    )
+
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "url": args.image,
+                },
+                {
+                    "type": "text",
+                    "text": args.prompt,
+                },
+            ],
+        }
+    ]
+
+    inputs = processor.apply_chat_template(
+        messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np"
+    )
+
+    # convert input to Tensor
+    for key, value in inputs.items():
+        if isinstance(value, np.ndarray):
+            inputs[key] = ms.tensor(value)
+        elif isinstance(value, list):
+            inputs[key] = ms.Tensor(value)
+
+    generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
+    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
+    output_text = processor.batch_decode(
+        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print(output_text)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Qwen3VL demo.")
+
+    parser.add_argument("--prompt", type=str, default="Describe this image.")
+    parser.add_argument(
+        "--image",
+        type=str,
+        default="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
+    )
+    parser.add_argument(
+        "--model_name", type=str, default="Qwen/Qwen3-VL-4B-Instruct", help="Path to the pre-trained model."
+    )
+    parser.add_argument(
+        "--attn_implementation",
+        type=str,
+        default="flash_attention_2",
+        choices=["flash_attention_2", "eager"],
+    )
+
+    # Parse the arguments
+    args = parser.parse_args()
+
+    generate(args)
@@ -0,0 +1,91 @@
+import argparse
+from functools import partial
+
+import numpy as np
+
+import mindspore as ms
+import mindspore.mint.distributed as dist
+from mindspore.communication import GlobalComm
+
+from mindone.trainers.zero import prepare_network
+from mindone.transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration
+
+
+def generate(args):
+    model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
+        args.model_name,
+        mindspore_dtype=ms.bfloat16,
+        attn_implementation=args.attn_implementation,
+    )
+
+    # use zero3 parallel
+    shard_fn = partial(prepare_network, zero_stage=3, optimizer_parallel_group=GlobalComm.WORLD_COMM_GROUP)
+    model = shard_fn(model)
+
+    processor = AutoProcessor.from_pretrained(
+        args.model_name,
+        use_fast=False,
+    )
+
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "url": args.image,
+                },
+                {
+                    "type": "text",
+                    "text": args.prompt,
+                },
+            ],
+        }
+    ]
+
+    inputs = processor.apply_chat_template(
+        messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np"
+    )
+
+    # convert input to Tensor
+    for key, value in inputs.items():
+        if isinstance(value, np.ndarray):
+            inputs[key] = ms.tensor(value)
+        elif isinstance(value, list):
+            inputs[key] = ms.Tensor(value)
+
+    generated_ids = model.generate(**inputs, max_new_tokens=128)
+    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
+    output_text = processor.batch_decode(
+        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print(output_text)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Qwen3VLMoE demo.")
+
+    parser.add_argument("--prompt", type=str, default="Describe this image.")
+    parser.add_argument(
+        "--image",
+        type=str,
+        default="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
+    )
+    parser.add_argument(
+        "--model_name", type=str, default="Qwen/Qwen3-VL-30B-A3B-Instruct", help="Path to the pre-trained model."
+    )
+    parser.add_argument(
+        "--attn_implementation",
+        type=str,
+        default="flash_attention_2",
+        choices=["flash_attention_2", "eager"],
+    )
+
+    # Parse the arguments
+    args = parser.parse_args()
+
+    # set up card communication
+    dist.init_process_group(backend="hccl")
+    ms.set_auto_parallel_context(parallel_mode="data_parallel")
+
+    generate(args)
@@ -2,6 +2,7 @@
 
 from .conv import Conv1d, Conv2d, Conv3d, Mint_Conv2d, Mint_Conv3d
 from .dense import Dense, Linear
+from .moe_text_experts import MoeTextExperts
 
 # {Original MindSpore Cell: New Cell in ZeRO3}
 PARALLEL_MODULES = {
@@ -14,4 +15,6 @@
     mint.nn.Linear: Linear,
 }
 
+SPECIAL_CASE_FOR_PARALLEL_MODULES = {nn.Cell: MoeTextExperts}
+
 __all__ = ["Conv1d", "Conv2d", "Conv3d", "Mint_Conv2d", "Mint_Conv3d", "Dense", "Linear"]
@@ -0,0 +1,70 @@
+from typing import Literal, Optional
+
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore import mint, nn
+from mindspore.communication import get_group_size, get_rank
+from mindspore.communication.management import GlobalComm
+from mindspore.context import ParallelMode
+from mindspore.parallel._utils import _get_parallel_mode
+
+from .param_wrapper import ZeroParamWrapper
+
+
+class MoeTextExperts(nn.Cell):
+    def __init__(
+        self,
+        net: nn.Cell,
+        zero_stage: Literal[0, 1, 2, 3] = 0,
+        optimizer_parallel_group: str = GlobalComm.WORLD_COMM_GROUP,
+        cell_type: Optional[mstype.Type] = None,
+    ):
+        super().__init__(auto_prefix=False)
+        self.net = net
+        self.set_param_wrapper(zero_stage, optimizer_parallel_group, cell_type)
+
+    def set_param_wrapper(self, zero_stage, optimizer_parallel_group, cell_type=None):
+        self.param_wrapper_gate_up_proj = nn.Identity()
+        self.param_wrapper_down_proj = nn.Identity()
+        if zero_stage == 3:
+            # Init parallel settings
+            is_parallel = _get_parallel_mode() == ParallelMode.DATA_PARALLEL
+            op_group_size = get_group_size(optimizer_parallel_group) if is_parallel else 1
+            op_rank_id = get_rank(optimizer_parallel_group) if is_parallel else 0
+            self.op_group_size = op_group_size
+            self.op_rank_id = op_rank_id
+            self.param_wrapper_gate_up_proj = ZeroParamWrapper(
+                self.net.gate_up_proj, zero_stage, optimizer_parallel_group, cell_type
+            )
+            if self.param_wrapper_gate_up_proj.need_rewrite:
+                self.net.gate_up_proj.assign_value(
+                    Tensor.from_numpy(
+                        self.net.gate_up_proj.numpy().reshape(op_group_size, -1, *self.net.gate_up_proj.shape[1:])[
+                            op_rank_id
+                        ]
+                    )
+                )
+            self.param_wrapper_down_proj = ZeroParamWrapper(
+                self.net.down_proj, zero_stage, optimizer_parallel_group, cell_type
+            )
+            if self.param_wrapper_down_proj.need_rewrite:
+                self.net.down_proj.assign_value(
+                    Tensor.from_numpy(
+                        self.net.down_proj.numpy().reshape(op_group_size, -1, *self.net.down_proj.shape[1:])[op_rank_id]
+                    )
+                )
+
+    def construct(self, hidden_states, routing_weights, router_indices):
+        batch_size = hidden_states.shape[0]
+        hidden_states = hidden_states.reshape(-1, self.net.hidden_size)  # (num_tokens, hidden_size)
+
+        hidden_states = hidden_states.repeat(self.net.num_experts, 1)
+        hidden_states = hidden_states.view(self.net.num_experts, -1, self.net.hidden_size)
+
+        gate_up = mint.bmm(hidden_states, self.param_wrapper_gate_up_proj(self.net.gate_up_proj))
+        gate, up = gate_up.chunk(2, dim=-1)  # not supported for DTensors
+        next_states = mint.bmm((up * self.net.act_fn(gate)), self.param_wrapper_down_proj(self.net.down_proj))
+        next_states = next_states.reshape(self.net.num_experts, batch_size, -1, self.net.hidden_size)
+        next_states = next_states * routing_weights.swapaxes(0, 1).view(self.net.num_experts, batch_size, -1)[..., None]
+        next_states = next_states.sum(dim=0)
+        return next_states