Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [-] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- I am reporting the issue to the correct repository. (Model Garden official or research directory)
- I checked to make sure that this issue has not already been filed.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/...
2. Describe the bug
32xCPUs seem not enough to fill 6x K80 GPUs for some scenarios while in other scenarios no problem. In a good case all 6xGPUs show nearly 100% utilization. Bad case, GPUs are alternating utilisation of 100% among each other. However, CPU utilisation across 32x cores is ~50% only.
3. Steps to reproduce
I am using the legacy due to the multi-GPU support for TF1:
object_detection/legacy/train.py --num_workers=6 --ps_tasks=1
I use the 'mask_rcnn_inception_v2_coco.config' and pretrained model.
batch_size = 12 (2 per GPU)
My images and masks are all same width x height. No resizing needed. Everything is stored in 10x TFRecord Shards.
I have trained successful models.
However, the main difference I can see between the scenarios if the number of instance masks is particular high.
In a bad case, more than 20 up to 60 instance masks per image sample. That seems to be the reason when GPU utilisation drops 4x.
No other augmentations or resizing needed. All pre-calculated and stored across 10x TFRecord shards.
4. Expected behavior
Expected to have 100% utilisation of GPUs. Did not expect the CPU to be the bottleneck.
5. Additional context
Did set in .config
model {
image_resizer {
identity_resizer {
}
}
train_input_reader {
batch_queue_capacity: 256
num_batch_queue_threads: 32
prefetch_queue_capacity: 256
}
No effect at all.
6. System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 LTS
- Mobile device name if the issue happens on a mobile device:
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 1.15.3
- Python version: 3.6
- Bazel version (if compiling from source): na
- GCC/Compiler version (if compiling from source): na
- CUDA/cuDNN version: 10.0
- GPU model and memory: 6 x K80 11GB