-
Notifications
You must be signed in to change notification settings - Fork 667
Open
Description
This is excellent work, but I have some questions that I hope can be answered.
- Based on the instructions in the README file, by specifying the corresponding configuration files, the Stage 1 training should be performed first, followed by Stage 2 training. In the Stage 2 training configuration file, the setting is that gram loss will only start being enabled after 1 Million iterations. However, according to the paper, after completing Stage 1 training, gram loss should be applied immediately when starting the next phase of training. Given the current training procedure, Stage 2 re-trains for 1 Million iterations from scratch (before gram loss is activated). Why is this the case?
- The paper shows that the dense features at 200K iterations in the early stage of training are better than those at 1M iterations. However, the current practice is to use the checkpoint from the end of Stage 1 (1M iterations) as the gram teacher for Stage 2. This is contradictory to the findings presented in the paper.
- If that is the case, can I then assume that Stage 1 (as defined in the configuration file) only serves the purpose of finding a Gram Teacher? And, does the second stage (Stage 2) launched with the current configuration file, actually simultaneously include the Pre-training stage (Stage 1 in the paper) and the Gram Anchor stage (Stage 2 in the paper)?
I look forward to your reply
Metadata
Metadata
Assignees
Labels
No labels