Memory Leak Training Faster-RCNN (Resnet 50)

# Prerequisites

I am using TF 1.15.2 as TF2 version of research/object_detection models not yet released...

## 1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py

Invokation:
python /home/ec2-user/SageMaker/models/research/object_detection/model_main.py --pipeline_config_path=/home/ec2-user/SageMaker/fuego-train/faster_rcnn_resnet50.config --model_dir=/home/ec2-user/SageMaker/data/model --num_train_steps=5800 --alsologtostderr

I am training the rcnn resnet50 model from the faster_rcnn_resnet50_coco_2018_01_28.tar.gz checkpoint on a 48cpu machine with 192GB memory (no GPU).

Model config uploaded (as .txt file)...
[faster_rcnn_resnet50.config.txt](https://github.com/tensorflow/models/files/4727149/faster_rcnn_resnet50.config.txt)

## 2. Describe the bug

On my training runs I don't reach any sort of stable memory state.
Memory requirements just continue to increase all the way to 192GB and then the job fails.

I get through 360 steps (batch size 1) in 10 min then checkpoint and eval and then some more steps. Training appears to be progressing ok e.g. loss is decreasing.  However, memory is ever increasing before run eventually fails when memory usage nears 192GB (runs sometimes fail on allocation or just crashes)

## 3. Steps to reproduce

Memory leaks consistently on every run.

## 4. Expected behaviour

Expect to reach a stable memory requirement during training.

## 5. Additional context

log attached (lots of TF warnings)...
[train.log](https://github.com/tensorflow/models/files/4727180/train.log)

tfrecord dataset is 2GB - please advise if you want me to provide this.

The rcnn resnet50 weights can be downloaded from the object_detection model zoo...


Note: Images are quite large - 1500x2000.  e.g. (Box in filename is single object bounding box within each image)
-rw-rw-r-- 1 ec2-user ec2-user  458461 Jun  2 12:49 69bravo-e-mobo-c__2019-08-13T14_21_44_Box_160x1092x313x1218.jpg
-rw-rw-r-- 1 ec2-user ec2-user  460998 Jun  2 12:49 69bravo-e-mobo-c__2019-08-13T14_22_44_Box_156x1082x328x1205.jpg
-rw-rw-r-- 1 ec2-user ec2-user  451599 Jun  2 12:49 69bravo-e-mobo-c__2019-08-13T14_26_44_Box_175x1044x344x1220.jpg
-rw-rw-r-- 1 ec2-user ec2-user  465679 Jun  2 12:49 69bravo-e-mobo-c__2019-08-13T14_27_44_Box_137x1051x382x1245.jpg
...

These images have been prepared as shard tfrecords:
-rw-rw-r-- 1 ec2-user ec2-user       35 Jun  4 01:40 smoke_label_map.pbtxt
-rw-rw-r-- 1 ec2-user ec2-user 46713957 Jun  4 01:39 smoke_train.record-00000-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 42074074 Jun  4 01:39 smoke_train.record-00001-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 42461823 Jun  4 01:39 smoke_train.record-00002-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 41608315 Jun  4 01:39 smoke_train.record-00003-of-00050
-rw-rw-r-- 1 ec2-user ec2-user 41683135 Jun  4 01:39 smoke_train.record-00004-of-00050
...


## 6. System information

- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):  
       Linux-4.14.171-105.231.amzn1.x86_64-x86_64-with-glibc2.9
- Mobile device name if the issue happens on a mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below): v1.15.2-2-gbcc274e 1.15.2
- Python version: 3.6.6
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: cuda-10.0
- GPU model and memory: None


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Leak Training Faster-RCNN (Resnet 50) #8621

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behaviour

5. Additional context

6. System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory Leak Training Faster-RCNN (Resnet 50) #8621

Description

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behaviour

5. Additional context

6. System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions