Skip to content

How to resume training on low priority VMs?Β #1575

Open
@johan12345

Description

@johan12345

I am running an Azure ML pipeline for Machine Learning training on a low priority compute cluster. So, occasionally, the VM will be preempted and restarted at a later time. In this case, I want to resume training from where the VM was stopped by loading the last model I saved in the outputs directory.

This use case is also mentioned in the docs:

In general, we recommend using Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).

However, it seems that while the saved models stored in the outputs directory are still shown in the Azure ML web interface after the VM restarts, my training script can not find them in that directory. Are these files not downloaded before the script is restarted? Which directory can I use instead to store these files?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions