Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resuming from checkpoint takes much memory #462

Open
allexeyj opened this issue Mar 10, 2023 · 20 comments
Open

resuming from checkpoint takes much memory #462

allexeyj opened this issue Mar 10, 2023 · 20 comments

Comments

@allexeyj
Copy link

I trained CoCa caption for one epoch, the 7GB checkpoint was saved, although the original model, which was fine tuned, was 2.5GB.

Then I decided to continue training by writing the following command:
!python -m training.main \ --dataset-type "csv" \ --train-data "/kaggle/input/coca-train/train_data.csv" \ --warmup 1000 \ --batch-size 6 \ --lr 1e-5 \ --wd 0.1 \ --epochs 1 \ --workers 8 \ --model "coca_ViT-L-14" \ --resume "/kaggle/input/coca-finetune/epoch_1_1.pt" \ --report-to "wandb" \ --coca-contrastive-loss-weight 0 \ --coca-caption-loss-weight 1 \ --log-every-n-steps 500

But I didn't have enough RAM. However, if instead of "resume" write "pretrained" and specify even the same checkpoint, no errors will occur, there will be enough memory.

How to continue training and fit within the allocated memory?

@gpucce
Copy link
Contributor

gpucce commented Mar 10, 2023

Hi @alexcode4u, the size difference between the checkpoint and the model, is because the checkpoint stores more things than the state_dict alone (I think gradient is the largest part), while the pre-trained or fine-tuned models only contain the state_dict of the model.

For the out of memory when resuming I am not sure, can you share some more info, or maybe try with --resume "latest"?

A random guess could be that you were using gradient checkpointing first and then you did not add it back in the resume run maybe?

Some questions to try and understand better, how is the the checkpoint epoch_1_1.pt obtained, is it the output of the previous run also with --epoch 1? does OOM occur immediately when you resume or after some steps?

@allexeyj
Copy link
Author

@gpucce Hi. https://pastebin.com/9psDYWhY These are all the logs that I see when I put resume. After the last line immediately OOM.

https://www.kaggle.com/code/leonidkulyk/openclip-coca-fine-tuning-w-b-optimized

This is how the checkpoint was obtained. I downloaded it from output/logs/2023_03_09-13_49_47-model_coca_ViT-L-14-lr_1e-05-b_6-j_2-p_amp/checkpoints.

@gpucce
Copy link
Contributor

gpucce commented Mar 10, 2023

@alexcode4u if I understand you are running this on kaggle, I am not too familiar with it so please bear with me a little and I try to help, the next question is, did you run fine-tuning and resume on the same notebbok one after the othter or did you restart the notebook?

I think this might be at least in part related to kaggle, do you mind if we discuss it there and then we post here a solution if we find one?

@allexeyj
Copy link
Author

@gpucce

  1. Firstly I started the notebook( https://www.kaggle.com/code/leonidkulyk/openclip-coca-fine-tuning-w-b-optimized) without resume to train first one epoch. The only thing I changed was to make the dataset twice larger(added 100k pics). All this has been training for 10 hours, after that i downloaded the checkpoint, restarted original(first) notebook with resume to train second one epoch. Restarted because the maximum session limit is 12 hours. The settings of the restarted notebook are completely the same (gpu, environment...). The only thing that i changed is "pretrained" to "resume" and passed new checkpoint.

  2. Yes, we can solve this problem on kaggle, then attach the solution here.

@allexeyj
Copy link
Author

I have uploaded 7GB checkpoint as dataset to kaggle. Can i give you access to it?

@allexeyj
Copy link
Author

allexeyj commented Mar 10, 2023

@gpucce Is there way to generate several outputs using top_k/top_p sampling for CoCa? May be there is some parameter like "num_return_sequences" in hugging face? I have not found anything like that in generate function. https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/coca_model.py#L167

Thanks for helping me and sorry for disturbing

@gpucce
Copy link
Contributor

gpucce commented Mar 10, 2023

Hey @alexcode4u no worries at all, I'm happy to help, there isn't option to do it automatically, a simple way could be replicating the input several times maybe.

I will answer better once I am at a computer!

@allexeyj
Copy link
Author

@gpucce The main problem is generating several results. Could you help to do it?

@gpucce
Copy link
Contributor

gpucce commented Mar 20, 2023

@alexcode4u I will try to help tomorrow, what do you mean that the main problem is generating results?

@allexeyj
Copy link
Author

@gpucce

I solved the problem with memory overflow when resuming training. Most of the difficulties arose with the generation of several results using CoCa. As far as I know, it is necessary to rewrite batch generation. But I have never done it myself. That's why it's the main problem.

@allexeyj
Copy link
Author

@gpucce If you are busy, can you please at least briefly say where to start? I have never done it myself.

@gpucce
Copy link
Contributor

gpucce commented Mar 21, 2023

@alexcode4u sorry i am really busy, however which geberation_type are you using?

@allexeyj
Copy link
Author

@gpucce beam search

@gpucce
Copy link
Contributor

gpucce commented Mar 21, 2023

@alexcode4u unfortunately speeding up beam search is a very long work, and I don't thing ut will be done in a short while. You could try top_p as generation type, it should be faster

@allexeyj
Copy link
Author

@gpucce ok

@allexeyj
Copy link
Author

@gpucce Should we expect that in the near future it will be possible to generate multiple results with CoCa in openclip? I think this will be useful to many.

@gpucce
Copy link
Contributor

gpucce commented Mar 22, 2023

@alexcode4u if I undestand correctly what your problem is, making beam_search faster needs a large rewriting and will take a lot of time to do.

However, can I ask you to explain how you are currently using generation, just in case there are simpler things that can be done?

@allexeyj
Copy link
Author

@gpucce My problem is not about making beam search faster. Currently i generate one caption to batch of images in this way:

images_batch = torch.cat(images_batch,0).to(device)
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(images_batch)

It generates one caption to each image in batch. But i want to generate several captions to each image in batch. That's my problem. As i understand, model.generate uses beam search by default, that's why said that i used beam search. But if generating several captions to image with beam search is trouble, i can easily change beam search to top p sampling. That's not problem.

@allexeyj
Copy link
Author

@gpucce Now when I have better explained the essence of the problem, I will ask again. Do you have any ready code for solving it? Or can you help in some other way without spending a lot of time? If it takes a long time, then don't, I understand that you are busy. Thank you for at least answering.

@gpucce
Copy link
Contributor

gpucce commented Mar 22, 2023

@alexcode4u so if the issue is just doing it in few lines of code, you can call .repeat or .repeat_interleave on the batch, however this will take longer of course.

Otherwise generating more captions for a single image is not too different from generating captions for different images, this is why I mentioned making beam_search faster.

I hope I understood the issue, unfortunately I can't think of simple things I can do to make it better right away

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants