Memory leaks when using tf.keras.metrics update_states in the master grpc server.

In ElasticDL, the master creates evaluation tasks and dispatch those tasks to workers.  Each worker reports the model output and labels to the master after doing an evaluation task. Then the master update `tf.keras.metric`s by those outputs and labels received by GRPC. However, GPRC is a multithreaded application. So it can parallelly receive the outputs from multiple workers and update `tf.keras.metric` in each launched thread. 

The memory leak is very obvious when we set `futures.ThreadPoolExecutor(max_workers=64)` and disappears if `futures.ThreadPoolExecutor(max_workers=1)`. We wonder that memory leak occurs when executing `tf.kreas.metric.update_states` using multi-threading. So we make an unit test to reproduce the memory leak using multi-threading and submit the issue to Tensorflow [issue 35044](https://github.com/tensorflow/tensorflow/issues/35044).

https://github.com/sql-machine-learning/elasticdl/blob/f8a8dbb6180e523543f62f3dc7f104c4ccc58c3c/elasticdl/python/master/evaluation_service.py#L76-L77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leaks when using tf.keras.metrics update_states in the master grpc server. #1568

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	for metric_inst in metrics.values():
	metric_inst.update_state(labels, outputs)

Memory leaks when using tf.keras.metrics update_states in the master grpc server. #1568

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions