End-to-end benchmarking for deep learning training pipelines #40

lorenzoh · 2021-10-13T15:42:11Z

lorenzoh
Oct 13, 2021
Maintainer

End-to-end benchmarking for deep learning training pipelines

Here I want to map out some ideas I have for benchmarking deep learning training pipelines. With pipeline I mean everything that happens when training a model for some number of epochs (I break this down below). The aim is provide an approachable workflow for tracking down bottlenecks in pipelines and enabling users to fix them.

I'll lay down ideas to investigate and improve the performance of everything but the models themselves; there are already lots of people working on that. Instead I'll focus on other parts in the pipeline that could slow training down by reducing hardware utilization. Given an implemented training pipeline, I want to make it easy to find out if you're utilizing your hardware (e.g. GPU(s)) and if not, have clear directions for finding out where bottlenecks are so you know what you can do. If your GPU is at an average 80% utilization and you increase it to 97%, you've just increased throughput by roughly 1.2 times. So while improving performance of models (e.g. through GPU kernels and memory management) is important for raising the upper performance bound, you also need to make sure the surrounding data pipeline doesn't slow down your training. For large datasets with complex data processing logic as in computer vision pipelines the latter can easily happen; when it does, it is currently not straightforward to find the cause. Making this benchmarking easy will give clear priorities on where functionality needs to be optimized in existing packages (e.g. DataAugmentation.jl, FastAI.jl) and also allow users to better debug the performance of their custom training pipelines.

First I break down where bottlenecks occur. Afterwards, I suggest a workflow for detecting bottlenecks and narrowing their source down, and finally I discuss a possible implementation in FluxTraining.jl and FastAI.jl.

Bottlenecks

There are different causes for underutilization of hardware:

Whe model cannot utilize the GPU fully e.g. due to lack of parallelization. We won't delve into this here.
The GPU is sitting idle waiting for the next batch of data
Additional training functionality like logging and metrics take too long

Problem 3. is the less common bottleneck but can occur when you're synchronously logging metrics and hyperparameters to external services, generating visualizations during training, moving data to or from hardware, or calculating expensive metrics. Many of these problems can be fixed by doing things asynchronously where possible to make sure the GPU can continue working.

Problem 2. is much more complex, because there is a lot more going on. Here's roughly what needs to happen before a batch from an out-of-memory dataset is ready:

For every observation in the batch:
- Load it from disk
- Augment it and preprocess it (let's call this "encoding")
Batch encoded observations together
Move batch to hardware

The most effective measure to take here is to do all of this in parallel on background threads (e.g. using DataLoaders.jl) so that each batch is instantly available when it is needed. But in cases where this is not enough (i.e. the background threads are fully utilized and batches can't be prepared quickly enough), you have to optimize the above steps. How to optimize these depends on what they're doing, and which step to optimize depends on if that step is the bottleneck and how much you can do about it. Here's an example of a simple strategy for greatly speeding up pipelines involving image loading.

Workflow

Now let's say you have a training pipeline and want to make sure it's running as fast as it can, i.e. utilizing your hardware. First you need to run the training loop and measure the utilization of your hardware. If it's close to 100%, you're good. Otherwise, you need to find the bottleneck:

If a good chunk of time is spent waiting for the next batch, you need to improve the throughput of your data loading pipeline. Asssuming you can't parallelize it any more, look for which of the above steps are the slowest and try to speed them up.
If training extras like logging are reducing utlization, narrow down which one and either run it asnychronously if possible or optimize it

Implementation

Here's how the above workflow could be integrated into FluxTraining.jl and FastAI.jl in a way that would require fine-grained benchmarking of any pipeline without customization needed.

Training loop profiler

The first step is to run a FluxTraining.jl training loop and use the Events as hooks for measurement. We can measure how long of a time passes between two events and how long it takes each callback to process each event. With this data it is possible to answer the following questions:

What is the average GPU utilization during a training step?
How much %time is spent waiting for the next batch of data?
How much %time is spent running callbacks instead of utilizing the GPU?

If callbacks become a bottleneck, we can narrow it down to find out exactly which callback takes a long time to handle which event.

The idea for this profiler was first proposed in this issue which describes the implementation. I have a prototype version working and will update this with some measurements soon. Usage of the profiler is unobtrusive, and requires just another argument to Learner:

profiler = ProfileRunner()
learner = Learner(...; cbrunner = profiler)

The timings can then be inspected to create reports that answer the above questions.

Data pipeline benchmarking

If instead data loading is the bottleneck despite using parallel data loading, we need to optimize that. In the FastAI.jl model, we can break data loading into

loading a sample from disk getobs(data, idx)
encoding the sample encode(encodings, _, blocks, sample) (augmentation and preprocessing)

Each step can be benchmarked separately to find out where a bottleneck is.

If loading from disk is slow, preprocessing the dataset files e.g. through presizing as mentioned above can be a good strategy
If the encodings are the bottleneck, we can can benchmark each encoding step separately and further narrow down which needs to be sped up

Implementation-wise this requires some simple functions to benchmark data containers and encodings.

While the exact strategies for how to fix each bottleneck differ, providing profiling and benchmarking tools to detect and narrow down the source of bottlenecks is the first step toward resolving them. The architecture of FluxTraining.jl and FastAI.jl makes the implementation of these tools relatively straightforward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end benchmarking for deep learning training pipelines #40

{{title}}

Replies: 0 comments

Select a reply

End-to-end benchmarking for deep learning training pipelines #40

lorenzoh Oct 13, 2021 Maintainer

End-to-end benchmarking for deep learning training pipelines

Bottlenecks

Workflow

Implementation

Training loop profiler

Data pipeline benchmarking

Replies: 0 comments

lorenzoh
Oct 13, 2021
Maintainer