Replies: 1 comment 3 replies
-
There were thoughts of using pytorch's TCP-Store for this (by wrapping that one, we probably could have one storage for all the states across metrics and it could be moved to cpu upon receiving and moved to GPU upon requesting it (maybe). Thoughts @maximsch2 @SkafteNicki ? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Question/discussion on the boundary of metrics and lightning itself: by default, we suggest people to store metrics as a field on the LightningModule, which means the metric is moved to whatever device the model is being trained/executed on. This is great as it makes the use of the metric easy, but how should we approach it if we want to keep it on CPU. Motivation for this is trying to minimize GPU RAM usage to leave it for the model itself, and also to allow us to run more metrics and more expensive metrics. This is relatively non-trivial now: we need to move metric to CPU, setup separate distributed GLOO based group (to exchange metric states), and manually handle moves of data back and forth. Any thoughts on better ways to solve this?
Beta Was this translation helpful? Give feedback.
All reactions