GRPO loss fix #535

vwxyzjn · 2025-01-30T15:27:19Z

CC @gauravpandeyamu @hamishivi for a second look.

I think the pg_loss = pg_loss + (args.beta * kl).mean() is incorrect because the mean operation happens after the scores are added with the corresponding KL

The correct impl should be

pg_loss = masked_mean(pg_loss_max + (args.beta * kl), ~padding_mask[micro_batch_inds])

Let me know what you think.

gauravpandeyamu · 2025-01-30T16:13:54Z

Yes, masked_mean is indeed correct to use here.

Also check my comment at #534 (comment)

The current kl1 estimator will completely ignore the ref_logprobs during trainining. kl2 and kl3 are fine.

vwxyzjn · 2025-01-31T17:40:44Z

hamishivi

Looks good to me. Nice on the metrics improvement!

hamishivi · 2025-01-31T17:41:41Z

open_instruct/grpo_vllm_thread_ray_gtrl.py

@@ -487,6 +487,34 @@ def remove_padding(sequences, pad_token_id):
    return [[inneritem for inneritem in item if inneritem != pad_token_id] for item in sequences]


+class MetricsTracker:


open_instruct/grpo_vllm_thread_ray_gtrl.py

GRPO loss fix

9985ef8

vwxyzjn requested review from natolambert and hamishivi January 30, 2025 15:27

vwxyzjn added 2 commits January 31, 2025 09:39

Make kl3 estimator work

d174c59

quick change

c4163a7

hamishivi approved these changes Jan 31, 2025

View reviewed changes

add a shared chatgpt link

8481f2a

vwxyzjn merged commit 79ff8fb into main Jan 31, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO loss fix #535

GRPO loss fix #535

vwxyzjn commented Jan 30, 2025

gauravpandeyamu commented Jan 30, 2025

vwxyzjn commented Jan 31, 2025

hamishivi left a comment

hamishivi Jan 31, 2025

		@@ -487,6 +487,34 @@ def remove_padding(sequences, pad_token_id):
		return [[inneritem for inneritem in item if inneritem != pad_token_id] for item in sequences]


		class MetricsTracker:

GRPO loss fix #535

GRPO loss fix #535

Conversation

vwxyzjn commented Jan 30, 2025

gauravpandeyamu commented Jan 30, 2025

vwxyzjn commented Jan 31, 2025

hamishivi left a comment

Choose a reason for hiding this comment

hamishivi Jan 31, 2025

Choose a reason for hiding this comment