resuming from checkpoint takes much memory #462

allexeyj · 2023-03-10T09:27:05Z

I trained CoCa caption for one epoch, the 7GB checkpoint was saved, although the original model, which was fine tuned, was 2.5GB.

Then I decided to continue training by writing the following command:
!python -m training.main \ --dataset-type "csv" \ --train-data "/kaggle/input/coca-train/train_data.csv" \ --warmup 1000 \ --batch-size 6 \ --lr 1e-5 \ --wd 0.1 \ --epochs 1 \ --workers 8 \ --model "coca_ViT-L-14" \ --resume "/kaggle/input/coca-finetune/epoch_1_1.pt" \ --report-to "wandb" \ --coca-contrastive-loss-weight 0 \ --coca-caption-loss-weight 1 \ --log-every-n-steps 500

But I didn't have enough RAM. However, if instead of "resume" write "pretrained" and specify even the same checkpoint, no errors will occur, there will be enough memory.

How to continue training and fit within the allocated memory?

The text was updated successfully, but these errors were encountered:

gpucce · 2023-03-10T09:40:39Z

Hi @alexcode4u, the size difference between the checkpoint and the model, is because the checkpoint stores more things than the state_dict alone (I think gradient is the largest part), while the pre-trained or fine-tuned models only contain the state_dict of the model.

For the out of memory when resuming I am not sure, can you share some more info, or maybe try with --resume "latest"?

A random guess could be that you were using gradient checkpointing first and then you did not add it back in the resume run maybe?

Some questions to try and understand better, how is the the checkpoint epoch_1_1.pt obtained, is it the output of the previous run also with --epoch 1? does OOM occur immediately when you resume or after some steps?

allexeyj · 2023-03-10T10:14:03Z

@gpucce Hi. https://pastebin.com/9psDYWhY These are all the logs that I see when I put resume. After the last line immediately OOM.

https://www.kaggle.com/code/leonidkulyk/openclip-coca-fine-tuning-w-b-optimized

This is how the checkpoint was obtained. I downloaded it from output/logs/2023_03_09-13_49_47-model_coca_ViT-L-14-lr_1e-05-b_6-j_2-p_amp/checkpoints.

gpucce · 2023-03-10T11:08:03Z

@alexcode4u if I understand you are running this on kaggle, I am not too familiar with it so please bear with me a little and I try to help, the next question is, did you run fine-tuning and resume on the same notebbok one after the othter or did you restart the notebook?

I think this might be at least in part related to kaggle, do you mind if we discuss it there and then we post here a solution if we find one?

allexeyj · 2023-03-10T12:21:12Z

@gpucce

Firstly I started the notebook( https://www.kaggle.com/code/leonidkulyk/openclip-coca-fine-tuning-w-b-optimized) without resume to train first one epoch. The only thing I changed was to make the dataset twice larger(added 100k pics). All this has been training for 10 hours, after that i downloaded the checkpoint, restarted original(first) notebook with resume to train second one epoch. Restarted because the maximum session limit is 12 hours. The settings of the restarted notebook are completely the same (gpu, environment...). The only thing that i changed is "pretrained" to "resume" and passed new checkpoint.
Yes, we can solve this problem on kaggle, then attach the solution here.

allexeyj · 2023-03-10T12:26:16Z

I have uploaded 7GB checkpoint as dataset to kaggle. Can i give you access to it?

allexeyj · 2023-03-10T19:11:15Z

@gpucce Is there way to generate several outputs using top_k/top_p sampling for CoCa? May be there is some parameter like "num_return_sequences" in hugging face? I have not found anything like that in generate function. https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/coca_model.py#L167

Thanks for helping me and sorry for disturbing

gpucce · 2023-03-10T19:19:48Z

Hey @alexcode4u no worries at all, I'm happy to help, there isn't option to do it automatically, a simple way could be replicating the input several times maybe.

I will answer better once I am at a computer!

allexeyj · 2023-03-20T19:54:21Z

@gpucce The main problem is generating several results. Could you help to do it?

gpucce · 2023-03-20T19:56:55Z

@alexcode4u I will try to help tomorrow, what do you mean that the main problem is generating results?

allexeyj · 2023-03-20T20:41:13Z

@gpucce

I solved the problem with memory overflow when resuming training. Most of the difficulties arose with the generation of several results using CoCa. As far as I know, it is necessary to rewrite batch generation. But I have never done it myself. That's why it's the main problem.

allexeyj · 2023-03-21T21:04:58Z

@gpucce If you are busy, can you please at least briefly say where to start? I have never done it myself.

gpucce · 2023-03-21T21:25:09Z

@alexcode4u sorry i am really busy, however which geberation_type are you using?

allexeyj · 2023-03-21T22:45:58Z

@gpucce beam search

gpucce · 2023-03-21T23:10:40Z

@alexcode4u unfortunately speeding up beam search is a very long work, and I don't thing ut will be done in a short while. You could try top_p as generation type, it should be faster

allexeyj · 2023-03-22T04:44:39Z

@gpucce ok

allexeyj · 2023-03-22T13:51:44Z

@gpucce Should we expect that in the near future it will be possible to generate multiple results with CoCa in openclip? I think this will be useful to many.

gpucce · 2023-03-22T15:51:04Z

@alexcode4u if I undestand correctly what your problem is, making beam_search faster needs a large rewriting and will take a lot of time to do.

However, can I ask you to explain how you are currently using generation, just in case there are simpler things that can be done?

allexeyj · 2023-03-22T18:47:51Z

@gpucce My problem is not about making beam search faster. Currently i generate one caption to batch of images in this way:

images_batch = torch.cat(images_batch,0).to(device)
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(images_batch)

It generates one caption to each image in batch. But i want to generate several captions to each image in batch. That's my problem. As i understand, model.generate uses beam search by default, that's why said that i used beam search. But if generating several captions to image with beam search is trouble, i can easily change beam search to top p sampling. That's not problem.

allexeyj · 2023-03-22T19:33:33Z

@gpucce Now when I have better explained the essence of the problem, I will ask again. Do you have any ready code for solving it? Or can you help in some other way without spending a lot of time? If it takes a long time, then don't, I understand that you are busy. Thank you for at least answering.

gpucce · 2023-03-22T19:47:11Z

@alexcode4u so if the issue is just doing it in few lines of code, you can call .repeat or .repeat_interleave on the batch, however this will take longer of course.

Otherwise generating more captions for a single image is not too different from generating captions for different images, this is why I mentioned making beam_search faster.

I hope I understood the issue, unfortunately I can't think of simple things I can do to make it better right away

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resuming from checkpoint takes much memory #462

resuming from checkpoint takes much memory #462

allexeyj commented Mar 10, 2023

gpucce commented Mar 10, 2023 •

edited

Loading

allexeyj commented Mar 10, 2023

gpucce commented Mar 10, 2023

allexeyj commented Mar 10, 2023

allexeyj commented Mar 10, 2023

allexeyj commented Mar 10, 2023 •

edited

Loading

gpucce commented Mar 10, 2023

allexeyj commented Mar 20, 2023

gpucce commented Mar 20, 2023

allexeyj commented Mar 20, 2023

allexeyj commented Mar 21, 2023

gpucce commented Mar 21, 2023

allexeyj commented Mar 21, 2023

gpucce commented Mar 21, 2023

allexeyj commented Mar 22, 2023

allexeyj commented Mar 22, 2023

gpucce commented Mar 22, 2023

allexeyj commented Mar 22, 2023

allexeyj commented Mar 22, 2023

gpucce commented Mar 22, 2023

resuming from checkpoint takes much memory #462

resuming from checkpoint takes much memory #462

Comments

allexeyj commented Mar 10, 2023

gpucce commented Mar 10, 2023 • edited Loading

allexeyj commented Mar 10, 2023

gpucce commented Mar 10, 2023

allexeyj commented Mar 10, 2023

allexeyj commented Mar 10, 2023

allexeyj commented Mar 10, 2023 • edited Loading

gpucce commented Mar 10, 2023

allexeyj commented Mar 20, 2023

gpucce commented Mar 20, 2023

allexeyj commented Mar 20, 2023

allexeyj commented Mar 21, 2023

gpucce commented Mar 21, 2023

allexeyj commented Mar 21, 2023

gpucce commented Mar 21, 2023

allexeyj commented Mar 22, 2023

allexeyj commented Mar 22, 2023

gpucce commented Mar 22, 2023

allexeyj commented Mar 22, 2023

allexeyj commented Mar 22, 2023

gpucce commented Mar 22, 2023

gpucce commented Mar 10, 2023 •

edited

Loading

allexeyj commented Mar 10, 2023 •

edited

Loading