Skip to content

[Feature]: abort conditions - there are several reasons a simulation can fail #272

@MaxBille

Description

@MaxBille

Contact Details

No response

Is your proposal related to a problem?

Simulations can fail due to several reasons:

  • NaN: when NaN values occur, the simulation is crashed. It should probably be aborted and the crash can be reported. The NaN reporter of Added NaNReporter and HighMaReporter to LettuceCFD #248 does that.
  • time constraints: your maximum simulation wall time may be limited (for example when you are using a HPC cluster and a maximum job time is 24 h). When the scheduler kills your job, all data could be lost.
  • other reasons: whatever you can think of

It would be nice if the program had several abort conditions, giving messages and information about the reason, time and circumstances of simulation termination. Also: termination could be wanted (NaN encountered, or close to time limit). Data up until this point could be saved (simulation state checkpoint and data export to files).

Describe the solution you'd like

It would be nice to have a couple of simulation abort conditions, which simulation-class or a reporter checks from time to time and aborts the simulation, if a condition is met and treats remaining data and sim.-state as needed.

example conditions:
0=fine, 1=timelimit reached, 2=nan found, 3=Ma>0.3 (and choose if sim. should stop or just report that),...

Describe alternatives you've considered

The implementation depends on who handles what: simulation, reporter or something else.

  1. Set simulation status for abort condition (reporter?)
  2. check if abort condition (simulation?)
  3. process data: write checkpoint and/or export data as needed (reporter or simulation?)
  4. end simulation and report abort condition and circumstances

Additional context

One could think about more features of the checkpointing functionality of simulaiton -> new issue...;
Torch doesn't like it, when the device on which the imported date was saved from (e.g. "cuda:1") is not the same as the device which lettuce is currently working on (e.g. "cuda:0").

Example: simulation 1 runs on cpu, a checkpoint is written...; later, simulation 2, which runs on gpu, should import that checkpoint as an initial condition. This would create a problem.
Another example: depending on the device-association on multi-GPU-computers, a similar issue might arise.

Always transferring data to the cpu before writing a checkpoint and then transferring it to the device currently in use, after reading a checkpoint can circumvent this.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions