Skip to content

Conversation

@Fzilan
Copy link
Collaborator

@Fzilan Fzilan commented Oct 25, 2025

based on #1387 (transfomrers 4.54.1 base pr)
Add the smollm3 model

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

wcrzlh and others added 30 commits August 18, 2025 15:22
@Fzilan Fzilan requested a review from vigo999 as a code owner October 25, 2025 04:03
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Fzilan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the Qwen3-VL multimodal vision-language model series within the mindone.transformers library. It includes the integration of both dense and Mixture-of-Experts (MoE) model architectures, accompanied by dedicated processing utilities for images and videos. Key architectural advancements like DeepStack integration and specialized parallelization for MoE layers are incorporated. The changes also involve substantial refactoring and updates to core library components, such as activation functions, image processing, and attention masking, to ensure robust functionality and future compatibility.

Highlights

  • New Qwen3-VL Model Integration: Added comprehensive support for the Qwen3-VL multimodal vision-language model series, encompassing both dense and Mixture-of-Experts (MoE) variants. This includes dedicated model architectures and processing utilities.
  • Multimodal Processing Enhancements: Implemented advanced processing for image and video inputs, featuring special token handling, DeepStack integration for leveraging multi-level Vision Transformer features, and improved temporal grounding for video understanding.
  • Parallelization for MoE Models: Introduced MoeTextExperts and updated mindone.trainers.zero to enable efficient Zero-3 parallelization, specifically tailored for Mixture-of-Experts layers within the Qwen3-VL architecture.
  • Core Library Refinements: Performed significant updates across core mindone.transformers components, including activation functions, image processing utilities, attention masking, and generation mechanisms, to enhance MindSpore compatibility and overall flexibility.
  • Updated Transformers Version: The internal mindone.transformers version has been updated to 4.54.1, reflecting broader compatibility and feature updates across the library.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3-VL model and includes a massive synchronization with the upstream transformers library (v4.54.1). The changes are extensive, touching many parts of the codebase, including model implementations, utility functions, and examples. The refactoring efforts to adopt newer APIs (like mint and Cache objects) and Python 3.9+ typing are commendable and improve code quality. I've identified a few issues, mainly related to typos in documentation and a potentially fragile implementation detail in the ZeRO parallelism logic. My main concern is the change in mindone/trainers/zero.py, which uses a parameter name check that could be brittle. Overall, this is a significant and valuable update.

Comment on lines +483 to +484
if net.trainable_params():
if "gate_up_proj" in net.trainable_params()[0].name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition if "gate_up_proj" in net.trainable_params()[0].name: is quite fragile as it depends on the name of the first trainable parameter. This could easily break if the model architecture or parameter order changes. A more robust approach would be to check for the attribute directly on the network object, for instance, using hasattr(net, 'gate_up_proj').

# if there are no pad tokens present, then add eos to the end
modified_input_ids[i] = mint.nn.functional.pad(each_input_id, (0, 1), value=eos_token_id)
# if there are no pad tokens present, then add eos to the end
modified_input_ids[i] = mint.nn.functional.pad(each_input_id, (0, 1), value=eos_token_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for handling padding tokens when adding an EOS token has been removed. The new implementation unconditionally pads the eos_token_id at the end of the sequence. If input_ids can contain padding tokens, this change might lead to incorrect placement of the EOS token (i.e., after padding). Please confirm if input_ids passed to this function are guaranteed not to have padding tokens. If they can contain padding, the original logic to insert the EOS token before the first padding token should be restored to prevent this potential issue.

Comment on lines +61 to +74
an endangered wild feline species native to Central Aisa.
...
**Appearance:** It has a stocky and robust build with short legs
and a large head relative to its body size. Its fur is thick and dense,
appearing somewhat fluffy or "matted,", which is characteristic']
```

Qwen3-VL-30B Outputs:
```
['Of course, here is detailed description of the image provided.\n\n
This is a dynamic and charming photograph of a Palla's cat (also known as a manul) in a snowy enviroment.
...
"Appearance:" The cat has a very distinctive apperance, characterized by its stocky, low-slung body and exceptionally
thick, dense fur. This coat is a mix of brownish"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

I've noticed a few typos in the model output examples. Correcting them will improve the documentation's clarity and professionalism.

  • Line 61: Aisa should be Asia.
  • Line 71: Palla's should be Pallas's and enviroment should be environment.
  • Line 73: apperance should be appearance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants