[Feature]: abort conditions - there are several reasons a simulation can fail

### Contact Details

_No response_

### Is your proposal related to a problem?

Simulations can fail due to several reasons:

- NaN: when NaN values occur, the simulation is crashed. It should probably be aborted and the crash can be reported. The NaN reporter of #248 does that.
- time constraints: your maximum simulation wall time may be limited (for example when you are using a HPC cluster and a maximum job time is 24 h). When the scheduler kills your job, all data could be lost.
- other reasons: whatever you can think of

It would be nice if the program had several abort conditions, giving messages and information about the reason, time and circumstances of simulation termination. Also: termination could be wanted (NaN encountered, or close to time limit). Data up until this point could be saved (simulation state checkpoint and data export to files).

### Describe the solution you'd like

It would be nice to have a couple of simulation abort conditions, which simulation-class or a reporter checks from time to time and aborts the simulation, if a condition is met and treats remaining data and sim.-state as needed.

example conditions:
0=fine, 1=timelimit reached, 2=nan found, 3=Ma>0.3 (and choose if sim. should stop or just report that),...

### Describe alternatives you've considered

The implementation depends on who handles what: simulation, reporter or something else.
1. Set simulation status for abort condition (reporter?)
2. check if abort condition (simulation?)
3. process data: write checkpoint and/or export data as needed (reporter or simulation?)
4. end simulation and report abort condition and circumstances

### Additional context

One could think about more features of the checkpointing functionality of simulaiton -> new issue...;
Torch doesn't like it, when the device on which the imported date was saved from (e.g. "cuda:1") is not the same as the device which lettuce is currently working on (e.g. "cuda:0").

Example: simulation 1 runs on cpu, a checkpoint is written...; later, simulation 2, which runs on gpu, should import that checkpoint as an initial condition. This would create a problem. 
Another example: depending on the device-association on multi-GPU-computers, a similar issue might arise.

Always transferring data to the cpu before writing a checkpoint and then transferring it to the device currently in use, after reading a checkpoint can circumvent this. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: abort conditions - there are several reasons a simulation can fail #272

Contact Details

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: abort conditions - there are several reasons a simulation can fail #272

Description

Contact Details

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions