-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Thank you for sharing this great work!
In the paper, the hyper-policy is trained with a two-stage reward, and it is stated that “the hyper-policy is continually updated.” Based on this description, I understood that Stage 2 training continues from the parameters learned in Stage 1, i.e., as a form of fine-tuning.
However, in the released code, it appears that:
- Stage 1 and Stage 2 training are executed separately
- In the Stage 2 training code, I could not find where the hyper-policy checkpoint from Stage 1 is loaded or resumed
- As a result, Stage 2 seems to start again from the same initial parameters as Stage 1
Is it possible that I missed a checkpoint load/resume code?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels