Skip to content

Training stops at the end when the model is being saved #782

Open
@K0pasz

Description

@K0pasz

I use this gaussian splatting tool in Google Colab because I do not have enough VRAM (6GB) on my PC (when I ran it on my PC it always stopped with an error that indicated to me that I do not have enough VRAM). The problem shows when I set the iterations "too" high (e.g 7000), then the training process automatically stops when it tries to save the model and the created splat. Furthermore I have seen a "^C" at the output so it looks like that the command terminates itself somehow.

My colab notebook looks like this:

%cd /content
!git clone --recursive https://github.com/graphdeco-inria/gaussian-splatting
!pip install -q plyfile

%cd /content/gaussian-splatting
!pip install -q /content/gaussian-splatting/submodules/diff-gaussian-rasterization
!pip install -q /content/gaussian-splatting/submodules/simple-knn

from google.colab import drive
drive.mount('/content/drive')

!python train.py -s /content/drive/MyDrive/for_nerf_by_sai_cli/colmap -i /content/drive/MyDrive/for_nerf_by_sai_cli/images -m /content/output --iterations 10000

The output:

2024-04-30 10:03:24.108816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 10:03:24.108868: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 10:03:24.116418: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-30 10:03:24.135188: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-30 10:03:25.983543: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Optimizing /content/output
Output folder: /content/output [30/04 10:03:29]
Reading camera 150/150 [30/04 10:03:29]
Loading Training Cameras [30/04 10:03:29]
Loading Test Cameras [30/04 10:03:33]
Number of points at initialisation :  34739 [30/04 10:03:33]
Training progress:  70% 7000/10000 [09:23<07:06,  7.04it/s, Loss=0.0648267]
[ITER 7000] Evaluating test: L1 0.08124514473112006 PSNR 19.291277433696546 [30/04 10:12:59]

[ITER 7000] Evaluating train: L1 0.044808738678693776 PSNR 22.959835433959963 [30/04 10:13:01]

[ITER 7000] Saving Gaussians [30/04 10:13:01]
^C

I tried to save the output into the connected environment's folder but the issue still remains.
If I run 5000 or less iterations than the output is saved correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions