求大佬解惑train_ppo.py中问题 #676

armtt · 2026-03-01T06:35:09Z

armtt
Mar 1, 2026

自定义的 Critic 模型（价值网络），继承自基础的语言模型 MiniMindLM

其作用是评估当前生成状态（State）的价值 V(s)

class CriticModel(MiniMindForCausalLM):
def init(self, params):
super().init(params)
# 将原有的语言模型输出头（lm_head，输出词表大小）替换为一个线性层
# 该线性层将隐藏状态映射为单一的标量值（即该状态的价值）
self.value_head = nn.Linear(params.hidden_size, 1)

def forward(self, input_ids=None, attention_mask=None, **kwargs):
    # 1. 前向传播：使用基础的 Transformer 模型获取所有 token 的隐藏状态
    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
    # 获取最后一层的隐藏状态并进行层归一化处理
    hidden_states = self.model.norm(outputs[0])
    # 2. 价值预测：将隐藏状态输入到 value_head 得到价值，并去掉最后一个维度 (B, SeqLen, 1) -> (B, SeqLen)
    values = self.value_head(hidden_states).squeeze(-1)
    return values

为什么 hidden_states = self.model.norm(outputs[0]) 中输入要使用outputs[0] 而不是output？

jingyaogong · 2026-03-01T16:36:31Z

jingyaogong
Mar 1, 2026
Maintainer

#608

你应该想问这个

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

求大佬解惑train_ppo.py中问题 #676

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

求大佬解惑train_ppo.py中问题 #676

Uh oh!

armtt Mar 1, 2026

自定义的 Critic 模型（价值网络），继承自基础的语言模型 MiniMindLM

其作用是评估当前生成状态（State）的价值 V(s)

Replies: 1 comment

Uh oh!

jingyaogong Mar 1, 2026 Maintainer

armtt
Mar 1, 2026

jingyaogong
Mar 1, 2026
Maintainer