Description
In ElasticDL, the master creates evaluation tasks and dispatch those tasks to workers. Each worker reports the model output and labels to the master after doing an evaluation task. Then the master update tf.keras.metric
s by those outputs and labels received by GRPC. However, GPRC is a multithreaded application. So it can parallelly receive the outputs from multiple workers and update tf.keras.metric
in each launched thread.
The memory leak is very obvious when we set futures.ThreadPoolExecutor(max_workers=64)
and disappears if futures.ThreadPoolExecutor(max_workers=1)
. We wonder that memory leak occurs when executing tf.kreas.metric.update_states
using multi-threading. So we make an unit test to reproduce the memory leak using multi-threading and submit the issue to Tensorflow issue 35044.
elasticdl/elasticdl/python/master/evaluation_service.py
Lines 76 to 77 in f8a8dbb