Merge pull request #933 from weecology/bw4sz-patch-1

ethanwhite · web-flow · commit 756221a1be5b · 2025-02-25T19:13:47.000-05:00
Update 07_scaling.md
diff --git a/docs/user_guide/07_scaling.md b/docs/user_guide/07_scaling.md
@@ -1,16 +1,42 @@
 # Scaling DeepForest using PyTorch Lightning
 
-Often we have a large number of tiles we want to predict. DeepForest uses [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to scale inference. This gives us access to powerful tools for scaling without any changes to user code. DeepForest automatically detects whether you are running on GPU or CPU. Within a single GPU node, you can scale training without needing to specify any additional arguments, since we use the ['auto' devices](https://lightning.ai/docs/pytorch/stable/common/trainer.html#devices) detection within PyTorch Lightning. For advanced users, DeepForest can [run across multiple SLURM nodes](https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html), each with multiple GPUs. 
-
-## Increase batch size
+### Increase batch size
 
 It is more efficient to run a larger batch size on a single GPU. This is because the overhead of loading data and moving data between the CPU and GPU is relatively large. By running a larger batch size, we can reduce the overhead of these operations. 
 
 ```
 m.config["batch_size"] = 16
 ```
 
-## Scaling inference across multiple GPUs
+## Training
+
+DeepForest's create_trainer argument passes any argument to pytorch lightning. This means we can use pytorch lightnings amazing distributed training specifications. There is a large amount of documentation, but we find the most helpful section is
+
+https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html
+
+For example on a SLURM cluster, we use the following line to get 5 gpus on a single node.
+```
+m.create_trainer(logger=comet_logger, accelerator="gpu", strategy="ddp", num_nodes=1, devices=devices)
+```
+
+While we rarely use multi-node GPU's, pytorch lightning has functionality at very large scales. We welcome users to share what configurations worked best.
+
+A few notes that can trip up those less used to multi-gpu training. These are for the default configurations and may vary on a specific system. We use a large University SLURM cluster with 'ddp' distributed data parallel.
+
+1. Batch-sizes are expressed _per_ _gpu_. If you tell DeepForest you want 2 images per batch and request 5 gpus, you are computing 10 images per forward pass across all GPUs. This is crucial for when profiling, make sure to scale any tests by the batch size!
+
+2. Each device gets its own portion of the dataset. This means that they do not interact during forward passes.
+
+3. Make sure to use srun when combining with SLURM! This is an easy one to miss and will cause training to hang without error. Documented here
+
+https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting.
+
+
+## Prediction
+
+Often we have a large number of tiles we want to predict. DeepForest uses [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to scale inference. This gives us access to powerful tools for scaling without any changes to user code. DeepForest automatically detects whether you are running on GPU or CPU. The parallelization strategy is to run each tile on a separate GPU, we cannot parallelize crops from within the same tile across GPUs inside of main.predict_tile(). If you set m.create_trainer(accelerator="gpu", devices=4), and run predict_tile, you will only use 1 GPU per tile. This is because we need access to all crops to create a mosiac of the predictions.
+
+### Scaling prediction across multiple GPUs
 
 There are a few situations in which it is useful to replicate the DeepForest module across many separate Python processes. This is especially helpful when we have a series of non-interacting tasks, often called 'embarrassingly parallel' processes. In these cases, no DeepForest instance needs to communicate with another instance. Rather than coordinating GPUs with the associated annoyance of overhead and backend errors, we can just launch separate jobs and let them finish on their own. One helpful tool in Python is [Dask](https://www.dask.org/). Dask is a wonderful open-source tool for coordinating large-scale jobs. Dask can be run locally, across multiple machines, and with an arbitrary set of resources.
 
@@ -69,4 +95,4 @@ We can wait to see the futures as they complete! Dask also has a beautiful visua
 for x in futures:
     completed_filename = x.result()
     print(completed_filename)
-```
+```