Skip to content

Unable to load models from saved checkpoints #9924

Open
@c-herring

Description

@c-herring

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/eager_few_shot_od_training_tf2_colab.ipynb

and

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

I am able to train model, performance is good on the model in memory.

However when I re-load the model from checkpoint, performance is very poor

3. Steps to reproduce

Execute this Colab notebook to reproduce problem:

https://colab.research.google.com/drive/1Izik5m3G8mWyP7y1UUJ96lAm2vOVvNgq?usp=sharing#scrollTo=dwkN1XOw99Ge

I have adapted the example eager training notebook to save checkpoints during training.

After training, the notebook performs inference on training data using the model in memory that is in training mode. It then overlays the detected boxes and plots the images. These results look good, ducks are reliably detected in images.

It then creates a new model and loads from the most recent checkpoint (taken after the last training step)

Inference is performed again and this reloaded model in inference mode does not perform well.

(Note occasionally, inference results can actually look good. There seems to be some variation that I can only put down to some level of randomness in which sometimes the model is 'accidently' in a state that it performs well. If this occurs, re-running the notebook will likely reproduce the problem)

4. Expected behavior

I expect that reloading a model from checkpoint should perform very similar to the model that was saved.

5. Additional context

The example notebook is also modified to train all layers (not just fine tune the heads) and to also train with batchnorm. This is to exaggerate the issue. The duck example is a pretty easy example, when only heads are fine tuned it gets to a pretty stable place that the reloaded model typically works well. This is not representative of a real dataset.

My investigations keep leading me back to something odd happening with batchnorm. When I train with batchnorm frozen, performance is typically a little bit better and the disparity between trained and re-loaded model is smaller. In the duck example notebook that is linked - training with batchnorm frozen will typically produce good results on both inferences. However when training with a real dataset, performance is terrible.

I also experience the same issues using model_main_tf2.py to train. I can't inference on the model in memory using this method, however training loss gets very low and then when loading checkpoint and trying to perform inference, nothing is detected.

6. System information

Configuration 1:

  • Windows 10
  • RTX2080TI 11GB
  • CUDA 10.1 / TF2.3 and CUDA 11.2 / TF2.4
  • Python 3.7.6

Configuration 2:

  • Colab notebook linked.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions