Skip to content

Conversation

@xiongjyu
Copy link
Collaborator

No description provided.

llm_sft_loss = torch.tensor(0.0, device=self._cfg.device)
if self.llm_policy_cfg.enable_llm and self.llm_policy_cfg.enable_rft:
with self._profile_block(name="train_llm_rft"):
llm_rft_loss = self.compute_rft_loss(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个target_value,我理解应该是和 observation 一一对应的,没有错位吧?比如target_value第一个值代表obs中第一个状态下的结果

sequence_log_probs = token_log_probs.sum(dim=-1) / (mask.sum(dim=-1) + 1e-8)

if self.llm_policy_cfg.rft_reward=='value':
rewards_tensor = torch.tensor(batch_values, device=self._cfg.device, dtype=torch.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewards_tensor 重命名为 advantage_tansor 吧?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@xiongjyu xiongjyu deleted the branch opendilab:dev-multitask-balance-clean-rft November 24, 2025 14:28
@xiongjyu xiongjyu closed this Nov 24, 2025
@xiongjyu xiongjyu deleted the dev-multitask-balance-clean-rft branch November 24, 2025 14:28
@xiongjyu xiongjyu reopened this Nov 24, 2025
@puyuan1996 puyuan1996 added the research Research work in progress label Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

research Research work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants