Metrics placement between CPU and GPU and syncs #97

maximsch2 · 2021-03-16T20:03:27Z

maximsch2
Mar 16, 2021

Question/discussion on the boundary of metrics and lightning itself: by default, we suggest people to store metrics as a field on the LightningModule, which means the metric is moved to whatever device the model is being trained/executed on. This is great as it makes the use of the metric easy, but how should we approach it if we want to keep it on CPU. Motivation for this is trying to minimize GPU RAM usage to leave it for the model itself, and also to allow us to run more metrics and more expensive metrics. This is relatively non-trivial now: we need to move metric to CPU, setup separate distributed GLOO based group (to exchange metric states), and manually handle moves of data back and forth. Any thoughts on better ways to solve this?

justusschock · 2021-03-17T16:47:14Z

justusschock
Mar 17, 2021
Maintainer

There were thoughts of using pytorch's TCP-Store for this (by wrapping that one, we probably could have one storage for all the states across metrics and it could be moved to cpu upon receiving and moved to GPU upon requesting it (maybe).

Thoughts @maximsch2 @SkafteNicki ?

3 replies

maximsch2 Mar 17, 2021
Author

The API seems pretty limiting. It can be fine for simpler metrics, but the fact that you need to increment counters one by one will likely be a bottleneck for bigger ones. precision/recall at threshold metrics will have 4*num_classes states, for metrics that need entire curve and use binning we'll need to multiply it further by num_thresholds

justusschock Mar 17, 2021
Maintainer

@maximsch2 Yes, it can be limiting for some points, but tbh. I don't think it will.

When thinking of something like

 store = dist.TCPStore("127.0.0.1", 0, 1, True)
store.add("first_state", [3, 5]) # just some placeholders
store.add("first_state", [4,5])

This should be fine. And we could use it with lists like here or also with tensors/integers/floats.
The benefit would be, that by sending things there we can immediately move them to cpu and also there is no additional sync necessary between the processes in ddp (especially. no GPU sync) since all processes just need to gather the states from that store.

Where do you see limits?

maximsch2 Mar 17, 2021
Author

I haven't used this before myself, but torch.distributed.Store.add docs seems to suggest that you can only do a single int.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics placement between CPU and GPU and syncs #97

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Metrics placement between CPU and GPU and syncs #97

maximsch2 Mar 16, 2021

Replies: 1 comment · 3 replies

justusschock Mar 17, 2021 Maintainer

maximsch2 Mar 17, 2021 Author

justusschock Mar 17, 2021 Maintainer

maximsch2 Mar 17, 2021 Author

maximsch2
Mar 16, 2021

Replies: 1 comment 3 replies

justusschock
Mar 17, 2021
Maintainer

maximsch2 Mar 17, 2021
Author

justusschock Mar 17, 2021
Maintainer

maximsch2 Mar 17, 2021
Author