-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during saving checkpoint with TensorRT-enabled PPO actor training #281
Comments
Update regarding the save checkpoint issue with TensorRT-enabled PPO training. We tested that falling back to |
Here is the error log for the setup after switching the
|
Got a similar problem, while the
BTW, the |
Describe the bug
During the PPO actor training run with TensorRT-enabled, there was an error encountered during the validation checkpointing process. The training was conducted using the TensorRT-LLM setup, as suggested in the documentation available at TRT-LLM Accelerated-RLHF. The latest Nemo docker was used for the experiment.
The issue occurred specifically when the training job was attempting to save the checkpoints while using TensorRT. However, when the PPO actor training was running without the TensorRT-enabled setup, the validation checkpointing process was successful, and the checkpoints were saved without any errors.
This is the error message:
Here is the list of files being saved for checkpoint when the PPO actor training running without the TensorRT-enabled setup:
Additionally, when the PPO actor training was running with TensorRT-enabled but with the validation checkpointing feature disabled, the training process did not encounter any errors. Here is the log of running PPO actor training by disabling the validation checkpointing:
In summary, the error was observed only during the validation checkpointing process when using the TensorRT-enabled setup. The training was successful without the TensorRT-enabled setup or when the validation checkpointing was disabled.
Steps/Code to reproduce bug
To reproduce the bug on a
p4DE with 8 A100 GPUs
. Pull the latest Nemo docker and launch the docker.Run PPO critic server inside docker.
Run training PPO actor inside docker.
Expected behavior
The expected outcome is that the PPO actor training run with the TensorRT-enabled setup be successful and the training process be able to save the checkpoints during training and validation checkpointing stages without encountering any issues.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Additional context
Using 8 NVIDIA A100-SXM4-80GB
The text was updated successfully, but these errors were encountered: