Description
I am running an Azure ML pipeline for Machine Learning training on a low priority compute cluster. So, occasionally, the VM will be preempted and restarted at a later time. In this case, I want to resume training from where the VM was stopped by loading the last model I saved in the outputs
directory.
This use case is also mentioned in the docs:
In general, we recommend using Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).
However, it seems that while the saved models stored in the outputs
directory are still shown in the Azure ML web interface after the VM restarts, my training script can not find them in that directory. Are these files not downloaded before the script is restarted? Which directory can I use instead to store these files?