Skip to content

Conversation

@lchu6
Copy link
Contributor

@lchu6 lchu6 commented Jun 18, 2025

No description provided.

@lchu6 lchu6 requested review from daviswer and shiqiangw June 18, 2025 15:05
@lchu6
Copy link
Contributor Author

lchu6 commented Jun 18, 2025

@daviswer I think the reporting should be correct but help me double check as I typed really quick.

ddp_stats[1] stores gnorm and we add only every grad_accum_steps.
ddp_stats[2] stores "denominator" and we only count "real steps".
ddp_stats[0] stores loss and we count all steps (so grad_accum_steps times more value than above 1 and 2), but since our loss was anyway loss = loss / grad_accum_steps, so this value should also be correct with above denominator.

So if we have a total steps of 15 with grad accum step = 3:
we have 5 values for gnorm and denominator=5, while 15 values for loss. but those 15 values of loss was already original_loss/3, so it pair with denominator=5.

Copy link
Collaborator

@daviswer daviswer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That logic checks out, looks good!

@shiqiangw shiqiangw merged commit 3f959ee into mamba-tiktoken Jun 18, 2025
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants