resource exhausted: oom when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:gpu:0 by allocator gpu_0_bfc


On a NCv6 DSVM with below capacity training fails on GPU with error below 
mY DEVICE CONFIG IS 👍 
        
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())


[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7812116703253316336
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 17054173950524152281
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4943319543177070316
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 13240489552444645707
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 272760832
locality {
  bus_id: 1
  links {
  }
}
incarnation: 7492174393969061769
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 8438:00:00.0, compute capability: 3.7"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 12026858701
locality {
  bus_id: 1
  links {
  }
}
incarnation: 15937933815363559834
physical_device_desc: "device: 1, name: Tesla K80, pci bus id: a251:00:00.0, compute capability: 3.7"
]


On this Step ::
6. Execute steps
You can run through the Transfer Learning section, then skip to Create AccelContainerImage. By default, because the custom weights section takes much longer for training twice, it is not saved as executable cells. You can copy the code or change cell type to 'Code'.


6.a. Training using Transfer Learning

i get error below 

WARNING: Logging before flag parsing goes to stderr.
W0321 14:33:10.677065 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/utils.py:198: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0321 14:33:10.690853 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/utils.py:205: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0321 14:33:10.695419 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/utils.py:110: The name tf.image.resize_bilinear is deprecated. Please use tf.compat.v1.image.resize_bilinear instead.

W0321 14:33:10.772550 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:93: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0321 14:33:10.773201 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:96: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0321 14:33:10.775439 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:109: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.

W0321 14:33:16.247627 140373013649152 deprecation.py:506] From /anaconda/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0321 14:33:16.261933 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0321 14:33:18.387907 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0321 14:33:19.522062 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0321 14:33:19.527127 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
0it [00:00, ?it/s]
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1355     try:
-> 1356       return fn(*args)
   1357     except errors.OpError as e:

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1340       return self._call_tf_sessionrun(
-> 1341           options, feed_dict, fetch_list, target_list, run_metadata)
   1342 

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1428         self._session, options, feed_dict, fetch_list, target_list,
-> 1429         run_metadata)
   1430 

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node resnet_v1_50/conv1/ExtractImagePatches}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Mean_1/_1047]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node resnet_v1_50/conv1/ExtractImagePatches}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-30-d1761b0252c8> in <module>
      6 with sess.as_default():
      7     in_images, image_tensors, features, preds, featurizer = construct_model(quantized=True)
----> 8     train_model(preds, in_images, img_train, label_train, is_retrain=False, train_epoch=10, learning_rate=0.01)
      9     accuracy = test_model(preds, in_images, img_test, label_test)
     10     print("Accuracy:", accuracy)

<ipython-input-28-e46af242ce25> in train_model(preds, in_images, img_train, label_train, is_retrain, train_epoch, learning_rate)
     31                                 feed_dict={in_images: contents,
     32                                            in_labels: label_chunk,
---> 33                                            K.learning_phase(): 1})
     34             avg_loss += loss / chunk_num
     35         print("Epoch:", (epoch + 1), "loss = ", "{:.3f}".format(avg_loss))

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    948     try:
    949       result = self._run(None, fetches, feed_dict, options_ptr,
--> 950                          run_metadata_ptr)
    951       if run_metadata:
    952         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1171     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1172       results = self._do_run(handle, final_targets, final_fetches,
-> 1173                              feed_dict_tensor, options, run_metadata)
   1174     else:
   1175       results = []

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1348     if handle is None:
   1349       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1350                            run_metadata)
   1351     else:
   1352       return self._do_call(_prun_fn, handle, feeds, fetches)

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1368           pass
   1369       message = error_interpolation.interpolate(message, self._graph)
-> 1370       raise type(e)(node_def, op, message)
   1371 
   1372   def _extend_graph(self):

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node resnet_v1_50/conv1/ExtractImagePatches (defined at /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:109) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Mean_1/_1047]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node resnet_v1_50/conv1/ExtractImagePatches (defined at /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:109) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'resnet_v1_50/conv1/ExtractImagePatches':
  File "/anaconda/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/anaconda/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 505, in start
    self.io_loop.start()
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
  File "/anaconda/envs/py36/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
  File "/anaconda/envs/py36/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
  File "/anaconda/envs/py36/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 787, in inner
    self.run()
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 365, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 272, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 542, in execute_request
    user_expressions, allow_stdin,
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 294, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2854, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2880, in _run_cell
    return runner(coro)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3057, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3248, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-d1761b0252c8>", line 7, in <module>
    in_images, image_tensors, features, preds, featurizer = construct_model(quantized=True)
  File "<ipython-input-25-8ca5948eb680>", line 13, in construct_model
    features = featurizer.import_graph_def(input_tensor=image_tensors)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py", line 109, in import_graph_def
    self.__saver = tf.train.import_meta_graph(self.__metagraph_location, input_map=input_map, clear_devices=True)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1449, in import_meta_graph
    **kwargs)[0]
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1473, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 857, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
    _ProcessNewOps(graph)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3751, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3751, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3641, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

resource exhausted: oom when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:gpu:0 by allocator gpu_0_bfc #880

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

resource exhausted: oom when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:gpu:0 by allocator gpu_0_bfc #880

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions