GRPO questions #2608

natolambert · 2025-01-22T23:00:49Z

Hey friends! I have some questions on the GRPO implementation, happy to discuss.

It looks like you apply the KL distance in the advantages, while the DeepSeekMath paper says “Also note that, instead of adding KLpenalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of 𝐴ˆ”
Did any thought go into making this a sum of loss and not mean? We aren’t sure

trl/trl/trainer/grpo_trainer.py

Line 286 in fe4b5ef

loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
I didn’t see the PPO clipping logic in policy gradient loss, coming soon?

qgallouedec · 2025-01-23T13:12:37Z

It looks like you apply the KL distance in the advantages, while the DeepSeekMath paper says “Also note that, instead of adding KLpenalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of 𝐴ˆ”

Here's how I understand it:
In PPO, the KL div is subtracted from the reward (adding KL penalty in the reward)

trl/trl/trainer/ppo_trainer.py

Lines 509 to 515 in 949db23

    
           # 4. compute rewards 
        
           kl = logprobs - ref_logprobs 
        
           non_score_reward = -args.kl_coef * kl 
        
           rewards = non_score_reward.clone() 
        
           actual_start = torch.arange(rewards.size(0), device=rewards.device) 
        
           actual_end = torch.where(sequence_lengths_p1 < rewards.size(1), sequence_lengths_p1, sequence_lengths) 
        
           rewards[[actual_start, actual_end]] += scores

Here, in GRPO, the advantage is computed without the KL term. It's just the output of the reward function, normalised per group:

trl/trl/trainer/grpo_trainer.py

Lines 274 to 281 in 949db23

    
           # Compute grouped-wise rewards 
        
           mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1) 
        
           std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1) 
        
           # Normalize the rewards to compute the advantages 
        
           mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0) 
        
           std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0) 
        
           advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

and later you subtract the KL term

trl/trl/trainer/grpo_trainer.py

Line 285 in 949db23

per_token_loss = -(advantages - self.beta * per_token_kl)

I don't know how complicated it would have been to integrate KL into the reward. It would probably look something like

# Subtract KL
rewards  = rewards - self.beta * per_token_kl  # <- THIS IS ADDED

# Compute grouped-wise rewards
mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)

# Normalize the rewards to compute the advantages
mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

# x - x.detach() allows for preserving gradients from x
advantages = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -advantages  # <- THIS IS MODIFIED

which seems pretty simple. But perhaps they means that the subsequent equations and calculations would have been more complicated.

qgallouedec · 2025-01-23T13:21:02Z

Did any thought go into making this a sum of loss and not mean? We aren’t sure

No. What is the underlying intuition? Something like this?

loss = ((per_token_loss * completion_mask).sum(dim=1)).mean()

qgallouedec · 2025-01-23T13:42:52Z

I didn’t see the PPO clipping logic in policy gradient loss, coming soon?

In the current implementation, we're just update once after a generation. In fact, we align with this sentence from the paper:

The policy model only has a single update following each exploration stage.

Therefore It implies that $\pi_{\theta_{\text{old}}} = \pi_\theta$, and the equation

$$\mathcal{J}_{\text{GRPO}}(\theta) =\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\left[\min \left(\frac{\pi_\theta(o_{i,t} | q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,< t})} \hat{A}_{i,t}, \text{clip}\left(\frac{\pi_\theta(o_{i,t} | q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon\right) \hat{A}_{i,t}\right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right]\right]$$

can be simplified to

$$\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} | q, o_{i,< t})\right]_\cancel{\nabla}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right].$$

But, we could support multiple update after each generation. And it would require to have this PPO clipping logic. It would probably allow to reuse generation and be more sample efficient. On the other hand, this would probably require a rather hard-to-read implementation, as the optimization step is performed in the parent trainer class. It would look something like

def __init__(self, ...):
    ...
    self.train_dataset = repeat_interleave(train_dataset, self.num_grpo_iterations)  # [prompt0, prompt1] -> [prompt0, prompt0, prompt0, prompt1, prompt1, prompt1]

def compute_loss(self, model, inputs):
    if self.step % self.num_grpo_iterations == 0: # self.num_grpo_iterations is 𝜇 in the paper
        completions = model.generate(prompts)
        self.old_log_probs = model(cat(prompts, completions))

    log_probs = model(cat(prompts, completions))
    log_ratio = log_probs - self.old_log_probs
    losses = min(exp(log_ratio)*advantages, clip(exp(log_ratio), 1-epsilon, 1+epsilon)*advantages)
    losses = losses - beta*kl

not sure if it's worth having this extra complexity.

SeunghyunSEO · 2025-01-23T16:06:23Z

Hey @natolambert!!

It looks like you apply the KL distance in the advantages, while the DeepSeekMath paper says “Also note that, instead of adding KLpenalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of 𝐴ˆ”

Here's how I understand it:
In PPO, the KL div is subtracted from the reward (adding KL penalty in the reward)

trl/trl/trainer/ppo_trainer.py

Lines 509 to 515 in 949db23

# 4. compute rewards

kl = logprobs - ref_logprobs

non_score_reward = -args.kl_coef * kl

rewards = non_score_reward.clone()

actual_start = torch.arange(rewards.size(0), device=rewards.device)

actual_end = torch.where(sequence_lengths_p1 < rewards.size(1), sequence_lengths_p1, sequence_lengths)

rewards[[actual_start, actual_end]] += scores

Here, in GRPO, the advantage is computed without the KL term. It's just the output of the reward function, normalised per group:

trl/trl/trainer/grpo_trainer.py

Lines 274 to 281 in 949db23

# Compute grouped-wise rewards

mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)

std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)

# Normalize the rewards to compute the advantages

mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)

std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)

advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

and later you subtract the KL term

trl/trl/trainer/grpo_trainer.py

Line 285 in 949db23

per_token_loss = -(advantages - self.beta * per_token_kl)

I don't know how complicated it would have been to integrate KL into the reward. It would probably look something like

Subtract KL

rewards = rewards - self.beta * per_token_kl # <- THIS IS ADDED

Compute grouped-wise rewards

mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)

Normalize the rewards to compute the advantages

mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

x - x.detach() allows for preserving gradients from x

advantages = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -advantages # <- THIS IS MODIFIED
which seems pretty simple. But perhaps they means that the subsequent _equations and calculations_ would have been more complicated.

i think there is miscommunication.
@natolambert pointed out you should not abstract kl term in advantages term following original GRPO impl and trl impl follows paper well. but the variable name make readers confused.

# Normalize the rewards to compute the advantages
        mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
        std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
        advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

        # x - x.detach() allows for preserving gradients from x
        advantages = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
        per_token_loss = -(advantages - self.beta * per_token_kl)
        loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()

here per_token_loss in last line means per_token_prob * A - kld not A - kld.
so the current impl is right.

and for the second question, i think we should mean both per group loss and per token loss following the paper.
so the current impl lgtm again.

p.s. and the reason why there is no clipping make sense because current impl does not allow to iteratively update policy using trajectories sampled from old policy, so log ratio will always be 1

qgallouedec · 2025-01-23T16:15:36Z

So this instead (just a renaming)?:

# x - x.detach() allows for preserving gradients from x
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()

Maybe more aligned with the formulation

directly adding the KL divergence between the trained policy and the reference policy to the loss

SeunghyunSEO · 2025-01-23T16:17:18Z

So this instead (just a renaming)?:
# x - x.detach() allows for preserving gradients from x
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
Maybe more aligned with the formulation

directly adding the KL divergence between the trained policy and the reference policy to the loss

lgtm :)

qgallouedec · 2025-01-23T16:23:09Z

#2616

natolambert · 2025-01-23T23:38:09Z

Yeah, I'm mostly aligned now (I haven't fully checked the math), but it seems like practically this is because you don't do minimatches you can get away with it? Do you have a line by line derivation (lol, I will ask cluade). 👀

Thanks @SeunghyunSEO for the snipe on the variable names, that's what caught me up. Looks fine now and thanks for updating it.

EDIT: okay, I see, if they are the same then the min and clip become redundant (cliped to 1, and min of two identical quantities), so it simplifies to what you have.

natolambert · 2025-01-23T23:39:19Z

For example, you can see our implementation in open instruct that we just added too allenai/open-instruct#523

github-actions bot added 🏋 PPO Related to PPO ❓ question Seeking clarification or more information labels Jan 22, 2025

August-murr added 🏋 GRPO Related to GRPO and removed 🏋 PPO Related to PPO labels Jan 23, 2025

qgallouedec mentioned this issue Jan 23, 2025

💎 Rename an inner var in GRPO to improve clarity #2616

Merged

5 tasks

qgallouedec mentioned this issue Jan 24, 2025

GRPO reward func #2644

Closed

nreHieW mentioned this issue Jan 26, 2025

Issue with GRPO Implementation #2664

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO questions #2608

GRPO questions #2608

natolambert commented Jan 22, 2025

qgallouedec commented Jan 23, 2025 •

edited

Loading

qgallouedec commented Jan 23, 2025 •

edited

Loading

qgallouedec commented Jan 23, 2025 •

edited

Loading

SeunghyunSEO commented Jan 23, 2025 •

edited

Loading

Subtract KL

Compute grouped-wise rewards

Normalize the rewards to compute the advantages

x - x.detach() allows for preserving gradients from x

qgallouedec commented Jan 23, 2025 •

edited

Loading

SeunghyunSEO commented Jan 23, 2025

qgallouedec commented Jan 23, 2025

natolambert commented Jan 23, 2025 •

edited

Loading

natolambert commented Jan 23, 2025

GRPO questions #2608

GRPO questions #2608

Comments

natolambert commented Jan 22, 2025

qgallouedec commented Jan 23, 2025 • edited Loading

qgallouedec commented Jan 23, 2025 • edited Loading

qgallouedec commented Jan 23, 2025 • edited Loading

SeunghyunSEO commented Jan 23, 2025 • edited Loading

Subtract KL

Compute grouped-wise rewards

Normalize the rewards to compute the advantages

x - x.detach() allows for preserving gradients from x

qgallouedec commented Jan 23, 2025 • edited Loading

SeunghyunSEO commented Jan 23, 2025

qgallouedec commented Jan 23, 2025

natolambert commented Jan 23, 2025 • edited Loading

natolambert commented Jan 23, 2025

qgallouedec commented Jan 23, 2025 •

edited

Loading

qgallouedec commented Jan 23, 2025 •

edited

Loading

qgallouedec commented Jan 23, 2025 •

edited

Loading

SeunghyunSEO commented Jan 23, 2025 •

edited

Loading

qgallouedec commented Jan 23, 2025 •

edited

Loading

natolambert commented Jan 23, 2025 •

edited

Loading