Skip to content

resource exhausted: oom when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:gpu:0 by allocator gpu_0_bfc #880

Open
@initmahesh

Description

@initmahesh

On a NCv6 DSVM with below capacity training fails on GPU with error below
mY DEVICE CONFIG IS 👍

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7812116703253316336
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 17054173950524152281
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4943319543177070316
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 13240489552444645707
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 272760832
locality {
bus_id: 1
links {
}
}
incarnation: 7492174393969061769
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 8438:00:00.0, compute capability: 3.7"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 12026858701
locality {
bus_id: 1
links {
}
}
incarnation: 15937933815363559834
physical_device_desc: "device: 1, name: Tesla K80, pci bus id: a251:00:00.0, compute capability: 3.7"
]

On this Step ::
6. Execute steps
You can run through the Transfer Learning section, then skip to Create AccelContainerImage. By default, because the custom weights section takes much longer for training twice, it is not saved as executable cells. You can copy the code or change cell type to 'Code'.

6.a. Training using Transfer Learning

i get error below

WARNING: Logging before flag parsing goes to stderr.
W0321 14:33:10.677065 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/utils.py:198: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0321 14:33:10.690853 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/utils.py:205: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0321 14:33:10.695419 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/utils.py:110: The name tf.image.resize_bilinear is deprecated. Please use tf.compat.v1.image.resize_bilinear instead.

W0321 14:33:10.772550 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:93: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0321 14:33:10.773201 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:96: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0321 14:33:10.775439 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:109: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.

W0321 14:33:16.247627 140373013649152 deprecation.py:506] From /anaconda/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
W0321 14:33:16.261933 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0321 14:33:18.387907 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0321 14:33:19.522062 140373013649152 deprecation_wrapper.py:119] From /anaconda/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0321 14:33:19.527127 140373013649152 deprecation.py:323] From /anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
0it [00:00, ?it/s]

ResourceExhaustedError Traceback (most recent call last)
/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1340 return self._call_tf_sessionrun(
-> 1341 options, feed_dict, fetch_list, target_list, run_metadata)
1342

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1428 self._session, options, feed_dict, fetch_list, target_list,
-> 1429 run_metadata)
1430

ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node resnet_v1_50/conv1/ExtractImagePatches}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Mean_1/_1047]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node resnet_v1_50/conv1/ExtractImagePatches}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last)
in
6 with sess.as_default():
7 in_images, image_tensors, features, preds, featurizer = construct_model(quantized=True)
----> 8 train_model(preds, in_images, img_train, label_train, is_retrain=False, train_epoch=10, learning_rate=0.01)
9 accuracy = test_model(preds, in_images, img_test, label_test)
10 print("Accuracy:", accuracy)

in train_model(preds, in_images, img_train, label_train, is_retrain, train_epoch, learning_rate)
31 feed_dict={in_images: contents,
32 in_labels: label_chunk,
---> 33 K.learning_phase(): 1})
34 avg_loss += loss / chunk_num
35 print("Epoch:", (epoch + 1), "loss = ", "{:.3f}".format(avg_loss))

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
948 try:
949 result = self._run(None, fetches, feed_dict, options_ptr,
--> 950 run_metadata_ptr)
951 if run_metadata:
952 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1171 if final_fetches or final_targets or (handle and feed_dict_tensor):
1172 results = self._do_run(handle, final_targets, final_fetches,
-> 1173 feed_dict_tensor, options, run_metadata)
1174 else:
1175 results = []

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1348 if handle is None:
1349 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1350 run_metadata)
1351 else:
1352 return self._do_call(_prun_fn, handle, feeds, fetches)

/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1368 pass
1369 message = error_interpolation.interpolate(message, self._graph)
-> 1370 raise type(e)(node_def, op, message)
1371
1372 def _extend_graph(self):

ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node resnet_v1_50/conv1/ExtractImagePatches (defined at /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:109) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Mean_1/_1047]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[16,112,112,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node resnet_v1_50/conv1/ExtractImagePatches (defined at /anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py:109) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'resnet_v1_50/conv1/ExtractImagePatches':
File "/anaconda/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/main.py", line 3, in
app.launch_new_instance()
File "/anaconda/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 505, in start
self.io_loop.start()
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 148, in start
self.asyncio_loop.run_forever()
File "/anaconda/envs/py36/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
self._run_once()
File "/anaconda/envs/py36/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
handle._run()
File "/anaconda/envs/py36/lib/python3.6/asyncio/events.py", line 145, in _run
self._callback(*self._args)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in
lambda f: self._run_callback(functools.partial(callback, future))
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 787, in inner
self.run()
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
yielded = self.gen.send(value)
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 365, in process_one
yield gen.maybe_future(dispatch(*args))
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
yielded = next(result)
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 272, in dispatch_shell
yield gen.maybe_future(handler(stream, idents, msg))
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
yielded = next(result)
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 542, in execute_request
user_expressions, allow_stdin,
File "/anaconda/envs/py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
yielded = next(result)
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 294, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2854, in run_cell
raw_cell, store_history, silent, shell_futures)
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2880, in _run_cell
return runner(coro)
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 68, in pseudo_sync_runner
coro.send(None)
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3057, in run_cell_async
interactivity=interactivity, compiler=compiler, result=result)
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3248, in run_ast_nodes
if (await self.run_code(code, result, async
=asy)):
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 7, in
in_images, image_tensors, features, preds, featurizer = construct_model(quantized=True)
File "", line 13, in construct_model
features = featurizer.import_graph_def(input_tensor=image_tensors)
File "/anaconda/envs/py36/lib/python3.6/site-packages/azureml/accel/models/accel_model.py", line 109, in import_graph_def
self.__saver = tf.train.import_meta_graph(self.__metagraph_location, input_map=input_map, clear_devices=True)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1449, in import_meta_graph
**kwargs)[0]
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1473, in _import_meta_graph_with_return_elements
**kwargs))
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 857, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
_ProcessNewOps(graph)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3751, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3751, in
for c_op in c_api_util.new_tf_operations(self)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3641, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions