[detailed] memory snapshot and footprint for non-blocking copy #3485

TroyGarden · 2025-10-26T04:12:15Z

Summary:

context

It has been well explained how to do a timly efficient copy of a tensor from host to device
A guide on good usage of non_blocking and pin_memory() in PyTorch
Two requirements to enable non-blocking data transfer from both the host and the device sides are:
use pin_memory on the host side
assign a side cuda stream for data transfer
a more realistic example with using the data on device after the data transfer:

# data are often from dataloader on the host side
host_tensor = dataloader().pin_memory()

main_stream = torch.cuda.current_stream()
side_stream = torch.cuda.Stream()

# use a side stream for non-blocking data transfer
# without this side stream the data transfer runs on current (main) stream 
with side_stream:
    device_tensor = host_tensor.to(device='cuda', non_blocking=True)
    # prevent device_tensor being freed on the side stream (since there's no follow-up usage on the side stream)
    device_tensor.record_stream(main_stream)

# device can do some irrelevant compute
some_function()

# main stream needs the data so has to wait for the data transfer to complete
main_stream.wait(side_stream)
use_the_data(device_tensor)

a small example also confirm the behavior (see the reproduce section later): with non-blocking data transfer, both cpu and gpu executions are non-blocking, the data transfer starts immediately with cpu execution. while without using an extra side-stream for data transfer, the host-to-device copy runs on the main cuda stream and only starts when the stream is available (delayed).

how about memory efficiency

However, the memory consumption (footprint) is not considered in the previous discussion
As explained in this blog: A guide to PyTorch's CUDA Caching Allocator, the tensor created in a side stream won't be shared by other streams.
In other words, although the tensor copied from host is primarily used in the main stream, it's created in the side stream so it will be collected and returned to the side stream by the cache allocator. The freed memory goes to the "reserved memory" associating with the side stream, and cannot be (re-)used by other operations in the main stream. The cuda memory footprint comparison is shown in the diagram below.

the memory snapshots show similar maximum memory usage in both scenarios in the active memory timeline, but the active cached segment timeline reveals different overall memory usage (footprint)

is it the price we must pay

Not necessarily. We can actually work around this by pre-allocating the memory from the main stream and use in-place copy to do the data transfer on the side stream. Just make sure the main stream await the side stream before using the transferred data, and once the data is deleted it can be re-used by the main stream again. diagram shown below

This idea is also verified in a small example (also included in the reproduce section). The executions are non-blocking on both cpu and gpu sides from the trace comparison.

the active memory timeline show similar pattern between the "pre-allcoated non-blocking copy" and "non-blocking copy", but in the active cached segment timeline, the "pre-allcoated non-blocking copy" has less memory usage (same as the "blocking copy") than the "non-blocking copy".

discussions

are all problems solved? unfortunately not yet
we noticed that the host-to-device data transfer speed is much slower in the in-place copy than the regular copy (5.8 GB/s vs 11.8 GB/s), regardless of blocking or non-blocking data transfer (shown in the first screenshot).
development experience sucks for the pre-allocation approach. the user has to explicitly figure out the input data size, and in a common use case the input are wrapped in the ModelInput class with complex data structure, which often only implements the .to(...) method, not the in-place copy.
how much headroom? most of the production models use (some kind of) TorchRec's train pipeline, which share the same copy_batch_to_gpu method. Estimating from a few production models we have been working on with, the input KJT is about 1~3GB on each rank. Also worth mentioning that judging from the trace in most cases the copy_batch_to_gpu step isn't very long (can afford 50% slowness), and in some suboptimal use case the input data does not uses pin_memory(), causing cpu-side blocking.
my wish: is it possible that PyTorch can make async host-to-device transfer execut on side stream while the transferred data on the main stream? something like:

host_tensor = dataloader().pin_memory()
main_stream = torch.cuda.current_stream()
side_stream = torch.cuda.Stream()

with side_stream:  # run data transfer on side stream
    tensor = host_tensor.to("cuda", non_blocking=True, target_stream=main_stream)
...

# use the transferred data
main_stream.wait(side_stream)
do_something(tensor)

benchmark

name	GPU Peak Memory alloc	GPU Peak Memory reserved	CPU Peak RSS
blocking_copy	0.13 GB	0.13 GB	1.49 GB
non_blocking_copy	0.13 GB	0.15 GB	1.49 GB
preallocated_non_blocking_copy	0.13 GB	0.13 GB	1.49 GB

reproduce

blocking copy

python -m torchrec/distributed/benchmark/benchmark_comms.py -- \
    a2a_single --memory_snapshot=1 --num_mul=20 \
   --name=blocking_copy

non-blocking copy

python -m torchrec/distributed/benchmark/benchmark_comms.py -- \
    a2a_single --memory_snapshot=1 --num_mul=20 \
   --name=non_blocking_copy

non-blocking copy with preallocated memory

python -m torchrec/distributed/benchmark/benchmark_comms.py -- \
    a2a_single --memory_snapshot=1 --num_mul=20 \
   --name=preallocated_non_blocking_copy

Differential Revision: D85508674

meta-codesync · 2025-10-26T04:12:32Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85508674.

Summary: # context * It has been well explained how to do a timly efficient copy of a tensor from host to device [A guide on good usage of non_blocking and pin_memory() in PyTorch](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html) * Two requirements to enable non-blocking data transfer from both the host and the device sides are: **use pin_memory on the host side** **assign a side cuda stream for data transfer** * a more realistic example with using the data on device after the data transfer: ``` # data are often from dataloader on the host side host_tensor = dataloader().pin_memory() main_stream = torch.cuda.current_stream() side_stream = torch.cuda.Stream() # use a side stream for non-blocking data transfer # without this side stream the data transfer runs on current (main) stream with side_stream: device_tensor = host_tensor.to(device='cuda', non_blocking=True) # prevent device_tensor being freed on the side stream (since there's no follow-up usage on the side stream) device_tensor.record_stream(main_stream) # device can do some irrelevant compute some_function() # main stream needs the data so has to wait for the data transfer to complete main_stream.wait(side_stream) use_the_data(device_tensor) ``` * a small example also confirm the behavior (see the reproduce section later): with non-blocking data transfer, both cpu and gpu executions are non-blocking, the data transfer starts immediately with cpu execution. while without using an extra side-stream for data transfer, the host-to-device copy runs on the main cuda stream and only starts when the stream is available (delayed). {F1983001559} # how about memory efficiency * However, the memory consumption (footprint) is not considered in the previous discussion * As explained in this blog: [A guide to PyTorch's CUDA Caching Allocator](https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html), the tensor created in a side stream won't be shared by other streams. * In other words, although the tensor copied from host is primarily used in the main stream, it's created in the side stream so it will be collected and returned to the side stream by the cache allocator. The freed memory goes to the "reserved memory" associating with the side stream, and cannot be (re-)used by other operations in the main stream. The cuda memory footprint comparison is shown in the diagram below. {F1983001783} * the memory snapshots show similar maximum memory usage in both scenarios in the **active memory timeline**, but the **active cached segment timeline** reveals different overall memory usage (footprint) {F1983001762} {F1983001764} # is it the price we must pay * Not necessarily. We can actually work around this by pre-allocating the memory from the main stream and use in-place copy to do the data transfer on the side stream. Just make sure the main stream await the side stream before using the transferred data, and once the data is deleted it can be re-used by the main stream again. diagram shown below {F1983001807} * This idea is also verified in a small example (also included in the reproduce section). The executions are non-blocking on both cpu and gpu sides from the trace comparison. {F1983001257} * the **active memory timeline** show similar pattern between the "pre-allcoated non-blocking copy" and "non-blocking copy", but in the **active cached segment timeline**, the "pre-allcoated non-blocking copy" has less memory usage (same as the "blocking copy") than the "non-blocking copy". {F1983001817} {F1983001823} # discussions * are all problems solved? unfortunately not yet we noticed that the host-to-device data transfer speed is much slower in the in-place copy than the regular copy (5.8 GB/s vs 11.8 GB/s), regardless of blocking or non-blocking data transfer (shown in the first screenshot). * development experience sucks for the pre-allocation approach. the user has to explicitly figure out the input data size, and in a common use case the input are wrapped in the `ModelInput` class with complex data structure, which often only implements the `.to(...)` method, not the in-place copy. * how much headroom? most of the production models use (some kind of) TorchRec's train pipeline, which share the same `copy_batch_to_gpu` method. Estimating from a few production models we have been working on with, the input KJT is about 1~3GB on each rank. Also worth mentioning that judging from the trace in most cases the `copy_batch_to_gpu` step isn't very long (can afford 50% slowness), and in some suboptimal use case the input data does not uses `pin_memory()`, causing cpu-side blocking. * my wish: is it possible that PyTorch can make async host-to-device transfer execute on side stream while the transferred data on the main stream? something like: ``` host_tensor = dataloader().pin_memory() main_stream = torch.cuda.current_stream() side_stream = torch.cuda.Stream() with side_stream: # run data transfer on side stream tensor = host_tensor.to("cuda", non_blocking=True, target_stream=main_stream) ... # use the transferred data main_stream.wait(side_stream) do_something(tensor) ``` # benchmark |name|GPU Peak Memory alloc|GPU Peak Memory reserved|CPU Peak RSS| |--|--| |blocking_copy|0.13 GB|0.13 GB|1.49 GB| |non_blocking_copy|0.13 GB|**0.15 GB**|1.49 GB| |preallocated_non_blocking_copy|0.13 GB|0.13 GB|1.49 GB| # reproduce * blocking copy ``` python -m torchrec/distributed/benchmark/benchmark_comms.py -- \ a2a_single --memory_snapshot=1 --num_mul=20 \ --name=blocking_copy ``` * non-blocking copy ``` python -m torchrec/distributed/benchmark/benchmark_comms.py -- \ a2a_single --memory_snapshot=1 --num_mul=20 \ --name=non_blocking_copy ``` * non-blocking copy with preallocated memory ``` python -m torchrec/distributed/benchmark/benchmark_comms.py -- \ a2a_single --memory_snapshot=1 --num_mul=20 \ --name=preallocated_non_blocking_copy ``` Differential Revision: D85508674

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 26, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 26, 2025

TroyGarden changed the title ~~memory snapshot and footprint for non-blocking copy~~ [detailed] memory snapshot and footprint for non-blocking copy Oct 26, 2025

TroyGarden force-pushed the export-D85508674 branch from d7aedbc to 2d87172 Compare October 26, 2025 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[detailed] memory snapshot and footprint for non-blocking copy #3485

[detailed] memory snapshot and footprint for non-blocking copy #3485

TroyGarden commented Oct 26, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[detailed] memory snapshot and footprint for non-blocking copy #3485

Are you sure you want to change the base?

[detailed] memory snapshot and footprint for non-blocking copy #3485

Conversation

TroyGarden commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

context

how about memory efficiency

is it the price we must pay

discussions

benchmark

reproduce

Uh oh!

meta-codesync bot commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TroyGarden commented Oct 26, 2025 •

edited

Loading