Skip to content

Error out of GPU memory during model training Β #9345

Open
@NAEE09

Description

@NAEE09

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [N] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • [Y] I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • [Y] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

I want to train the model Mask R-CNN Inception ResNet V2 1024x1024, I have my dataset coverted to .record file, the pipeline model is configured, and the GPU works with other training models. I tried to limit the GPU memory (also works in other training models) but the error still appears.

Error:

2020-10-06 12:10:44.322216: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 12:10:44.322569: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184

3. Steps to reproduce

#from ~/models/research
python object_detection/model_main_tf2.py --pipeline_config_path=/home/robotronics/Projects/blm_Mask_RCNN/model_MaskRCNN/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8/model.config --model_dir=/home/robotronics/Projects/blm_Mask_RCNN/blm/models/model --num_train_steps=5000 --sample_1_of_n_eval_examples=10 --alsologstostderr

4. Expected behavior

Complete training model

5. Additional context

I try to limit the memory in the model_main_tf2.py and model_lib_v2.py

import tensorflow as tfl

gpus = tfl.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
  # Currently, memory growth needs to be the same across GPUs
    tfl.config.experimental.set_virtual_device_configuration(gpus[0],[tfl.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tfl.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
  # Memory growth must be set before GPUs have been initialized
    print(e)

I did the examples of the documentation https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/auto_examples/plot_object_detection_checkpoint.html and also work.

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device name if the issue happens on a mobile device:
  • TensorFlow installed from (source or binary): 2.2.0
  • TensorFlow version (use command below): 2.2.0
  • Python version: 3.6.9
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.1/7.6.5
  • GPU model and memory: GeForce GTX 1080

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions