-
Notifications
You must be signed in to change notification settings - Fork 565
[detailed] memory snapshot and footprint for non-blocking copy #3485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
TroyGarden
wants to merge
1
commit into
meta-pytorch:main
Choose a base branch
from
TroyGarden:export-D85508674
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+89
−16
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85508674. |
Summary: # context * It has been well explained how to do a timly efficient copy of a tensor from host to device [A guide on good usage of non_blocking and pin_memory() in PyTorch](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html) * Two requirements to enable non-blocking data transfer from both the host and the device sides are: **use pin_memory on the host side** **assign a side cuda stream for data transfer** * a more realistic example with using the data on device after the data transfer: ``` # data are often from dataloader on the host side host_tensor = dataloader().pin_memory() main_stream = torch.cuda.current_stream() side_stream = torch.cuda.Stream() # use a side stream for non-blocking data transfer # without this side stream the data transfer runs on current (main) stream with side_stream: device_tensor = host_tensor.to(device='cuda', non_blocking=True) # prevent device_tensor being freed on the side stream (since there's no follow-up usage on the side stream) device_tensor.record_stream(main_stream) # device can do some irrelevant compute some_function() # main stream needs the data so has to wait for the data transfer to complete main_stream.wait(side_stream) use_the_data(device_tensor) ``` * a small example also confirm the behavior (see the reproduce section later): with non-blocking data transfer, both cpu and gpu executions are non-blocking, the data transfer starts immediately with cpu execution. while without using an extra side-stream for data transfer, the host-to-device copy runs on the main cuda stream and only starts when the stream is available (delayed). {F1983001559} # how about memory efficiency * However, the memory consumption (footprint) is not considered in the previous discussion * As explained in this blog: [A guide to PyTorch's CUDA Caching Allocator](https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html), the tensor created in a side stream won't be shared by other streams. * In other words, although the tensor copied from host is primarily used in the main stream, it's created in the side stream so it will be collected and returned to the side stream by the cache allocator. The freed memory goes to the "reserved memory" associating with the side stream, and cannot be (re-)used by other operations in the main stream. The cuda memory footprint comparison is shown in the diagram below. {F1983001783} * the memory snapshots show similar maximum memory usage in both scenarios in the **active memory timeline**, but the **active cached segment timeline** reveals different overall memory usage (footprint) {F1983001762} {F1983001764} # is it the price we must pay * Not necessarily. We can actually work around this by pre-allocating the memory from the main stream and use in-place copy to do the data transfer on the side stream. Just make sure the main stream await the side stream before using the transferred data, and once the data is deleted it can be re-used by the main stream again. diagram shown below {F1983001807} * This idea is also verified in a small example (also included in the reproduce section). The executions are non-blocking on both cpu and gpu sides from the trace comparison. {F1983001257} * the **active memory timeline** show similar pattern between the "pre-allcoated non-blocking copy" and "non-blocking copy", but in the **active cached segment timeline**, the "pre-allcoated non-blocking copy" has less memory usage (same as the "blocking copy") than the "non-blocking copy". {F1983001817} {F1983001823} # discussions * are all problems solved? unfortunately not yet we noticed that the host-to-device data transfer speed is much slower in the in-place copy than the regular copy (5.8 GB/s vs 11.8 GB/s), regardless of blocking or non-blocking data transfer (shown in the first screenshot). * development experience sucks for the pre-allocation approach. the user has to explicitly figure out the input data size, and in a common use case the input are wrapped in the `ModelInput` class with complex data structure, which often only implements the `.to(...)` method, not the in-place copy. * how much headroom? most of the production models use (some kind of) TorchRec's train pipeline, which share the same `copy_batch_to_gpu` method. Estimating from a few production models we have been working on with, the input KJT is about 1~3GB on each rank. Also worth mentioning that judging from the trace in most cases the `copy_batch_to_gpu` step isn't very long (can afford 50% slowness), and in some suboptimal use case the input data does not uses `pin_memory()`, causing cpu-side blocking. * my wish: is it possible that PyTorch can make async host-to-device transfer execute on side stream while the transferred data on the main stream? something like: ``` host_tensor = dataloader().pin_memory() main_stream = torch.cuda.current_stream() side_stream = torch.cuda.Stream() with side_stream: # run data transfer on side stream tensor = host_tensor.to("cuda", non_blocking=True, target_stream=main_stream) ... # use the transferred data main_stream.wait(side_stream) do_something(tensor) ``` # benchmark |name|GPU Peak Memory alloc|GPU Peak Memory reserved|CPU Peak RSS| |--|--| |blocking_copy|0.13 GB|0.13 GB|1.49 GB| |non_blocking_copy|0.13 GB|**0.15 GB**|1.49 GB| |preallocated_non_blocking_copy|0.13 GB|0.13 GB|1.49 GB| # reproduce * blocking copy ``` python -m torchrec/distributed/benchmark/benchmark_comms.py -- \ a2a_single --memory_snapshot=1 --num_mul=20 \ --name=blocking_copy ``` * non-blocking copy ``` python -m torchrec/distributed/benchmark/benchmark_comms.py -- \ a2a_single --memory_snapshot=1 --num_mul=20 \ --name=non_blocking_copy ``` * non-blocking copy with preallocated memory ``` python -m torchrec/distributed/benchmark/benchmark_comms.py -- \ a2a_single --memory_snapshot=1 --num_mul=20 \ --name=preallocated_non_blocking_copy ``` Differential Revision: D85508674
d7aedbc to
2d87172
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
meta-exported
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
context
A guide on good usage of non_blocking and pin_memory() in PyTorch
use pin_memory on the host side
assign a side cuda stream for data transfer
how about memory efficiency
is it the price we must pay
discussions
we noticed that the host-to-device data transfer speed is much slower in the in-place copy than the regular copy (5.8 GB/s vs 11.8 GB/s), regardless of blocking or non-blocking data transfer (shown in the first screenshot).
ModelInputclass with complex data structure, which often only implements the.to(...)method, not the in-place copy.copy_batch_to_gpumethod. Estimating from a few production models we have been working on with, the input KJT is about 1~3GB on each rank. Also worth mentioning that judging from the trace in most cases thecopy_batch_to_gpustep isn't very long (can afford 50% slowness), and in some suboptimal use case the input data does not usespin_memory(), causing cpu-side blocking.benchmark
reproduce
Differential Revision: D85508674