fix(offpolicy): unify single/multi-GPU learner log interval#656
Merged
Conversation
Make DoubleBufferOffPolicyRunner log training metrics every iteration instead of every 10 iterations. This aligns its TensorBoard scalar density with MultiGPUOffPolicyRunner so single-GPU and multi-GPU runs are directly comparable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make
DoubleBufferOffPolicyRunnerlog training metrics every iteration instead of every 10 iterations, aligning its TensorBoard scalar density withMultiGPUOffPolicyRunner.Motivation
Single-GPU FlashSAC runs were logging 10x fewer scalar points than multi-GPU runs because
DoubleBufferOffPolicyRunner.LEARNER_LOG_INTERVAL = 10only calledlogger.log_stepevery 10 iterations. This made direct reward-curve comparison across GPU counts misleading.Changes
LEARNER_LOG_INTERVALfromDoubleBufferOffPolicyRunner.logger.log_stepon every training iteration.log_interval=1.Validation
uv run pytest tests/algos/test_offpolicy_double_buffer_runner.py -qpasses (46 passed).make test-allfails locally due to an unrelated MuJoCobatch_envnative abort intests/base/test_mujoco_batch_env_jacobian.py, not caused by this change.