[model] Support deepseek-v4 #86
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds documentation for deepseek_v4 support and updates the transformer layer to integrate with newer Megatron-Core offloading interfaces. Feedback indicates that the offloading implementation is incomplete, as it fails to properly initialize the normalization managers, and the local import should be moved for efficiency. Additionally, the implementation for the deepseek_v4 model appears to be missing from the current changes.
| if hasattr(self, '_set_offload_modules'): | ||
| from megatron.core.transformer.transformer_layer import _get_offloading_interface | ||
| self._set_offload_modules() | ||
| self.off_interface = _get_offloading_interface() | ||
| self.mlp_norm_manager = None |
There was a problem hiding this comment.
The initialization of offloading managers for Megatron-Core 0.17+ is incomplete. Setting self.mlp_norm_manager = None without assigning a manager from self.off_interface effectively disables offloading for the MLP layer normalization, even when it is configured in offload_modules. Additionally, self.attn_norm_manager should also be initialized to avoid potential AttributeError in base class methods that expect it to be present in newer versions of Megatron-Core.
Also, the local import of _get_offloading_interface inside __init__ is inefficient as it executes for every layer instantiation; consider moving it to the top of the file if possible.
if hasattr(self, '_set_offload_modules'):
from megatron.core.transformer.transformer_layer import _get_offloading_interface
self._set_offload_modules()
self.off_interface = _get_offloading_interface()
offload_modules = getattr(self.config, 'offload_modules', []) or []
is_offloading = getattr(self.config, 'fine_grained_activation_offloading', False)
self.attn_norm_manager = self.off_interface.get_manager('attn_norm') if is_offloading and 'attn_norm' in offload_modules else None
self.mlp_norm_manager = self.off_interface.get_manager('mlp_norm') if is_offloading and 'mlp_norm' in offload_modules else None| | -------- | ------------------------------------------------------------ | | ||
| | Qwen | qwen2, qwen2_moe<br />qwen3, qwen3_moe, qwen3_next | | ||
| | DeepSeek | deepseek_v3, deepseek_v32 | | ||
| | DeepSeek | deepseek_v3, deepseek_v32, deepseek_v4 | |
There was a problem hiding this comment.
The PR adds deepseek_v4 to the list of supported models, but the actual implementation appears to be missing. The file src/mcore_bridge/model/gpts/deepseek_v4.py is empty in the provided context, and there are no changes to model registration or configuration logic to support this new model type. Please ensure the implementation is included or clarify if it relies on an existing model type.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the DeepSeek-V4 model, implementing its hybrid attention mechanism, hyper-connections, and specific configuration mappings. Key changes include updates to the state dict conversion logic, model configuration parsing, and the transformer layer's RoPE application. The reviewer identified several critical issues, including a regression in backward compatibility for existing models due to modified default argument values and a potential break in RoPE functionality when hyper-connections are enabled. Feedback also highlighted risks of runtime errors during configuration parsing and tensor concatenation, alongside a recommendation to use deep copies for module instances to prevent shared state issues.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the DeepSeek-V4 model, incorporating new configuration parameters, state dict conversion logic for hyper-connections and hash layers, and a specialized hybrid attention implementation. It also refactors RoPE application to support inverse operations and interleaving removal. Feedback focused on a critical bug where shallow copying RotaryEmbedding objects causes incorrect positional embeddings due to shared caches, and a logic error in model registration that disables necessary RoPE patching when hyper-connections are enabled. Additionally, improvements were suggested to prevent potential runtime errors during state dict conversion by handling missing attributes and null tensors.
|
TODO: MTP; FP4/FP8 load/save; shell |

huggingface/transformers#45643
NVIDIA/Megatron-LM#4458
modelscope/ms-swift#9386