-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resuming from checkpoint takes much memory #462
Comments
Hi @alexcode4u, the size difference between the checkpoint and the model, is because the checkpoint stores more things than the state_dict alone (I think gradient is the largest part), while the pre-trained or fine-tuned models only contain the state_dict of the model. For the out of memory when resuming I am not sure, can you share some more info, or maybe try with --resume "latest"? A random guess could be that you were using gradient checkpointing first and then you did not add it back in the resume run maybe? Some questions to try and understand better, how is the the checkpoint |
@gpucce Hi. https://pastebin.com/9psDYWhY These are all the logs that I see when I put resume. After the last line immediately OOM. https://www.kaggle.com/code/leonidkulyk/openclip-coca-fine-tuning-w-b-optimized This is how the checkpoint was obtained. I downloaded it from output/logs/2023_03_09-13_49_47-model_coca_ViT-L-14-lr_1e-05-b_6-j_2-p_amp/checkpoints. |
@alexcode4u if I understand you are running this on kaggle, I am not too familiar with it so please bear with me a little and I try to help, the next question is, did you run fine-tuning and resume on the same notebbok one after the othter or did you restart the notebook? I think this might be at least in part related to kaggle, do you mind if we discuss it there and then we post here a solution if we find one? |
|
I have uploaded 7GB checkpoint as dataset to kaggle. Can i give you access to it? |
@gpucce Is there way to generate several outputs using top_k/top_p sampling for CoCa? May be there is some parameter like "num_return_sequences" in hugging face? I have not found anything like that in generate function. https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/coca_model.py#L167 Thanks for helping me and sorry for disturbing |
Hey @alexcode4u no worries at all, I'm happy to help, there isn't option to do it automatically, a simple way could be replicating the input several times maybe. I will answer better once I am at a computer! |
@gpucce The main problem is generating several results. Could you help to do it? |
@alexcode4u I will try to help tomorrow, what do you mean that the main problem is generating results? |
I solved the problem with memory overflow when resuming training. Most of the difficulties arose with the generation of several results using CoCa. As far as I know, it is necessary to rewrite batch generation. But I have never done it myself. That's why it's the main problem. |
@gpucce If you are busy, can you please at least briefly say where to start? I have never done it myself. |
@alexcode4u sorry i am really busy, however which geberation_type are you using? |
@gpucce beam search |
@alexcode4u unfortunately speeding up beam search is a very long work, and I don't thing ut will be done in a short while. You could try top_p as generation type, it should be faster |
@gpucce ok |
@gpucce Should we expect that in the near future it will be possible to generate multiple results with CoCa in openclip? I think this will be useful to many. |
@alexcode4u if I undestand correctly what your problem is, making beam_search faster needs a large rewriting and will take a lot of time to do. However, can I ask you to explain how you are currently using generation, just in case there are simpler things that can be done? |
@gpucce My problem is not about making beam search faster. Currently i generate one caption to batch of images in this way:
It generates one caption to each image in batch. But i want to generate several captions to each image in batch. That's my problem. As i understand, model.generate uses beam search by default, that's why said that i used beam search. But if generating several captions to image with beam search is trouble, i can easily change beam search to top p sampling. That's not problem. |
@gpucce Now when I have better explained the essence of the problem, I will ask again. Do you have any ready code for solving it? Or can you help in some other way without spending a lot of time? If it takes a long time, then don't, I understand that you are busy. Thank you for at least answering. |
@alexcode4u so if the issue is just doing it in few lines of code, you can call .repeat or .repeat_interleave on the batch, however this will take longer of course. Otherwise generating more captions for a single image is not too different from generating captions for different images, this is why I mentioned making beam_search faster. I hope I understood the issue, unfortunately I can't think of simple things I can do to make it better right away |
I trained CoCa caption for one epoch, the 7GB checkpoint was saved, although the original model, which was fine tuned, was 2.5GB.
Then I decided to continue training by writing the following command:
!python -m training.main \ --dataset-type "csv" \ --train-data "/kaggle/input/coca-train/train_data.csv" \ --warmup 1000 \ --batch-size 6 \ --lr 1e-5 \ --wd 0.1 \ --epochs 1 \ --workers 8 \ --model "coca_ViT-L-14" \ --resume "/kaggle/input/coca-finetune/epoch_1_1.pt" \ --report-to "wandb" \ --coca-contrastive-loss-weight 0 \ --coca-caption-loss-weight 1 \ --log-every-n-steps 500
But I didn't have enough RAM. However, if instead of "resume" write "pretrained" and specify even the same checkpoint, no errors will occur, there will be enough memory.
How to continue training and fit within the allocated memory?
The text was updated successfully, but these errors were encountered: