Training stops at the end when the model is being saved

I use this gaussian splatting tool in Google Colab because I do not have enough VRAM (6GB) on my PC (when I ran it on my PC it always stopped with an error that indicated to me that I do not have enough VRAM). The problem shows when I set the iterations "too" high (e.g 7000), then the training process automatically stops when it tries to save the model and the created splat. Furthermore I have seen a "^C" at the output so it looks like that the command terminates itself somehow.

My colab notebook looks like this:
```
%cd /content
!git clone --recursive https://github.com/graphdeco-inria/gaussian-splatting
!pip install -q plyfile

%cd /content/gaussian-splatting
!pip install -q /content/gaussian-splatting/submodules/diff-gaussian-rasterization
!pip install -q /content/gaussian-splatting/submodules/simple-knn

from google.colab import drive
drive.mount('/content/drive')

!python train.py -s /content/drive/MyDrive/for_nerf_by_sai_cli/colmap -i /content/drive/MyDrive/for_nerf_by_sai_cli/images -m /content/output --iterations 10000
``` 

The output:

```
2024-04-30 10:03:24.108816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 10:03:24.108868: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 10:03:24.116418: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-30 10:03:24.135188: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-30 10:03:25.983543: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Optimizing /content/output
Output folder: /content/output [30/04 10:03:29]
Reading camera 150/150 [30/04 10:03:29]
Loading Training Cameras [30/04 10:03:29]
Loading Test Cameras [30/04 10:03:33]
Number of points at initialisation :  34739 [30/04 10:03:33]
Training progress:  70% 7000/10000 [09:23<07:06,  7.04it/s, Loss=0.0648267]
[ITER 7000] Evaluating test: L1 0.08124514473112006 PSNR 19.291277433696546 [30/04 10:12:59]

[ITER 7000] Evaluating train: L1 0.044808738678693776 PSNR 22.959835433959963 [30/04 10:13:01]

[ITER 7000] Saving Gaussians [30/04 10:13:01]
^C
``` 

I tried to save the output into the connected environment's folder but the issue still remains.
If I run 5000 or less iterations than the output is saved correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training stops at the end when the model is being saved #782

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training stops at the end when the model is being saved #782

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions