-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Contact Details
No response
Is your proposal related to a problem?
Simulations can fail due to several reasons:
- NaN: when NaN values occur, the simulation is crashed. It should probably be aborted and the crash can be reported. The NaN reporter of Added NaNReporter and HighMaReporter to LettuceCFD #248 does that.
- time constraints: your maximum simulation wall time may be limited (for example when you are using a HPC cluster and a maximum job time is 24 h). When the scheduler kills your job, all data could be lost.
- other reasons: whatever you can think of
It would be nice if the program had several abort conditions, giving messages and information about the reason, time and circumstances of simulation termination. Also: termination could be wanted (NaN encountered, or close to time limit). Data up until this point could be saved (simulation state checkpoint and data export to files).
Describe the solution you'd like
It would be nice to have a couple of simulation abort conditions, which simulation-class or a reporter checks from time to time and aborts the simulation, if a condition is met and treats remaining data and sim.-state as needed.
example conditions:
0=fine, 1=timelimit reached, 2=nan found, 3=Ma>0.3 (and choose if sim. should stop or just report that),...
Describe alternatives you've considered
The implementation depends on who handles what: simulation, reporter or something else.
- Set simulation status for abort condition (reporter?)
- check if abort condition (simulation?)
- process data: write checkpoint and/or export data as needed (reporter or simulation?)
- end simulation and report abort condition and circumstances
Additional context
One could think about more features of the checkpointing functionality of simulaiton -> new issue...;
Torch doesn't like it, when the device on which the imported date was saved from (e.g. "cuda:1") is not the same as the device which lettuce is currently working on (e.g. "cuda:0").
Example: simulation 1 runs on cpu, a checkpoint is written...; later, simulation 2, which runs on gpu, should import that checkpoint as an initial condition. This would create a problem.
Another example: depending on the device-association on multi-GPU-computers, a similar issue might arise.
Always transferring data to the cpu before writing a checkpoint and then transferring it to the device currently in use, after reading a checkpoint can circumvent this.