Description
Prerequisites
I am using TF 1.15.2 as TF2 version of research/object_detection models not yet released...
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py
Invokation:
python /home/ec2-user/SageMaker/models/research/object_detection/model_main.py --pipeline_config_path=/home/ec2-user/SageMaker/fuego-train/faster_rcnn_resnet50.config --model_dir=/home/ec2-user/SageMaker/data/model --num_train_steps=5800 --alsologtostderr
I am training the rcnn resnet50 model from the faster_rcnn_resnet50_coco_2018_01_28.tar.gz checkpoint on a 48cpu machine with 192GB memory (no GPU).
Model config uploaded (as .txt file)...
faster_rcnn_resnet50.config.txt
2. Describe the bug
On my training runs I don't reach any sort of stable memory state.
Memory requirements just continue to increase all the way to 192GB and then the job fails.
I get through 360 steps (batch size 1) in 10 min then checkpoint and eval and then some more steps. Training appears to be progressing ok e.g. loss is decreasing. However, memory is ever increasing before run eventually fails when memory usage nears 192GB (runs sometimes fail on allocation or just crashes)
3. Steps to reproduce
Memory leaks consistently on every run.
4. Expected behaviour
Expect to reach a stable memory requirement during training.
5. Additional context
log attached (lots of TF warnings)...
train.log
tfrecord dataset is 2GB - please advise if you want me to provide this.
The rcnn resnet50 weights can be downloaded from the object_detection model zoo...
Note: Images are quite large - 1500x2000. e.g. (Box in filename is single object bounding box within each image)
-rw-rw-r-- 1 ec2-user ec2-user 458461 Jun 2 12:49 69bravo-e-mobo-c__2019-08-13T14_21_44_Box_160x1092x313x1218.jpg
-rw-rw-r-- 1 ec2-user ec2-user 460998 Jun 2 12:49 69bravo-e-mobo-c__2019-08-13T14_22_44_Box_156x1082x328x1205.jpg
-rw-rw-r-- 1 ec2-user ec2-user 451599 Jun 2 12:49 69bravo-e-mobo-c__2019-08-13T14_26_44_Box_175x1044x344x1220.jpg
-rw-rw-r-- 1 ec2-user ec2-user 465679 Jun 2 12:49 69bravo-e-mobo-c__2019-08-13T14_27_44_Box_137x1051x382x1245.jpg
...
These images have been prepared as shard tfrecords:
-rw-rw-r-- 1 ec2-user ec2-user 35 Jun 4 01:40 smoke_label_map.pbtxt
-rw-rw-r-- 1 ec2-user ec2-user 46713957 Jun 4 01:39 smoke_train.record-00000-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 42074074 Jun 4 01:39 smoke_train.record-00001-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 42461823 Jun 4 01:39 smoke_train.record-00002-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 41608315 Jun 4 01:39 smoke_train.record-00003-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 41683135 Jun 4 01:39 smoke_train.record-00004-of-00050
...
6. System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux-4.14.171-105.231.amzn1.x86_64-x86_64-with-glibc2.9 - Mobile device name if the issue happens on a mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below): v1.15.2-2-gbcc274e 1.15.2
- Python version: 3.6.6
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: cuda-10.0
- GPU model and memory: None