Finetune vs Lora settings. (need some help) #251

Z1-Gamer · 2024-04-11T17:13:53Z

Z1-Gamer
Apr 11, 2024

Hi. So first let me say that I started messing around with training Lora about 2 months ago, and have done about 50 training runs in total. This is so you can judge my current experience level :D. Not much but not a total newbie.
When reading further into the subject I noticed in a couple of places people were saying "for the best quality do a full finetune and then extract the Lora". This is what I have been trying (and failing) to do in the past week.

Here is my general problem, I cannot get an extracted Lora from a finetune which can capture the subject (I am training a character with multiple outfits). I have also trained a Lora (without finetune) with the exact same dataset and captioning which turned out good (It's up on CivitAI right now).
I am at my wits end as to what the problem could be. Here are some of my settings for both the Lora and the the finetunes I have tried.

The training set and captions are the same for all trainings. 22 images manually captioned and masked.
Lora:
Repeats=10, Epochs=10, Batch=1, Total Steps=2200 Learning Rate(+TE)0.0001+0.00005, Rank/Alpha=128/128 Optimizer=AdamW, Scheduler=Constant, Model=SDXL 1.0 (Base), Masked Training=Yes, Resolution=1024
As I mentioned, this Lora trained nicely, I took an earlier checkpoint (at 1540 steps) which was great, all 4 outfits for my character were more or less replicated.
Now I wanted to do the same thing with a finetune + extracted Lora, only no matter what settings I use it can never capture my character and all the outfits. The quality is great otherwise :D. I have done more steps as well for testing, 2200, 4400 and today 8800 but nothing doing, still just as bad. Tried both AdamW and Adafactor as optimizers and various other things.

Finetune General Type 1:
Repeats=10, Epochs=10, Batch=1, Total Steps=2200 Learning Rate(+TE)0.00001+0.000003, Optimizer=ADAFactor, Scheduler=Constant, Model=SDXL 1.0 (Base), Masked Training=Yes, Resolution=1024
Finetune General Type 2:
Repeats=1, Epochs=200, Batch=1, Total Steps=4400 Learning Rate(+TE)0.00001+0.000003, Optimizer=AdamW, Scheduler=Constant, Model=SDXL 1.0 (Base), Masked Training=Yes, Resolution=1024

With about 15 training runs with mixed settings in between. tried a higher learning rate but already at 0.00002 it gets totally broken really fast, the highest LR I could put was about 0.000017 before it gets broken. Also tried training with D-ADAPT-Lion with LR at 1, same thing, it wasn't able to reproduce the outfits at all.

If you have an idea what might be the problem please share. I guess I didn't think finetune training would be that different to Lora training.
I can upload some generated images for comparison if you think that might help. I am just really getting tired right now.

Answered by Calamdor

Apr 11, 2024

Generally speaking, updating the gradients during a finetune is a destructive event. A LoRA at batch 1 is possible but incredibly ill advised (due to incredibly noisy gradients, please use BS 2 at least) because it does not touch the original model weights.

Fine tuning should only be considered if you have a 24GB vram card and tens of thousands of images.

If you want to try and do this with batch 1, you can try adding accumulation steps to even out the updates but this is also not advised because gradient accumulation has inferior results to plain BS and lowers perf. Using accumulation steps will also allow higher learning rates to be used without breaking the model quickly as well. For e…

View full answer

Calamdor · 2024-04-11T17:46:55Z

Calamdor
Apr 11, 2024

Generally speaking, updating the gradients during a finetune is a destructive event. A LoRA at batch 1 is possible but incredibly ill advised (due to incredibly noisy gradients, please use BS 2 at least) because it does not touch the original model weights.

Fine tuning should only be considered if you have a 24GB vram card and tens of thousands of images.

If you want to try and do this with batch 1, you can try adding accumulation steps to even out the updates but this is also not advised because gradient accumulation has inferior results to plain BS and lowers perf. Using accumulation steps will also allow higher learning rates to be used without breaking the model quickly as well. For example, using accumulation steps of 10 will not update the gradients until every 10th step.

There are also new optimizers coming out. The Facebook schedule free optimizer may help when ready. It seemed to be really good at learning details in my initial testing.

The adaptive optimizers do not work well with finetunes with any settings I have found. Since a LoRA is empty at creation, it is much easier for the adaptive optimizers to work with them in that regard.

You also may need to balance your input set to give more focus on the outfit it should be learning that it does not seem to.

I tried this once on a 16GB card for SDXL, and just went back to LoRAs. I did not see the benefit and it took too long on my card and you really should use the highest quality settings when doing a finetune, where a LoRA can have decent quality even at FP8 weights.

One suggestion I have seen on the discord is training a LoRA at high network rank (as you have done) and then using the kohya tools to trim it down to a smaller network rank. This can remove a lot of the extra data you may not want and lets the LoRA focus on the key aspects you trained.

4 replies

Z1-Gamer Apr 11, 2024
Author

Thank you very much for this post! It was most informative and a joy to read really :).

I recently got a 4090 and that's why I decided to try finetuning and compare the quality with my regular Lora trainings. so 24GB VRAM is a must I know.
I had no idea about the batch size, I will give it a try with maybe 4, or 8?

Yeah FP32 training dramatically increases quality, I did some tests with that too. Even switching just the mixed precision to FP32 has a big effect. BTW I need to mention that I really love the fact that in OneTrainer I can make identical finetunes (idk if it's possible in Kohya).
I.e. doing 2 separate finetune training runs with the exact same settings yields the exact same result. This way you can test any setting and see what effect it has on the training. Amazing really :).

I also don't generally like the adaptive optimizers. I find that manually I can always get a better result (with maybe 5-10 attempts at most).
I have heard about Lora resizing before but haven't tried it yet, seems like a good idea though.

Thanks for all your other suggestions as well. I will keep experimenting and hopefully something will come out of it.

Calamdor Apr 11, 2024

You can try a batch of 4 and accumulation of steps of 4, which gives you a virtual batch of 16, and see how run goes. Batch 4 appears to be the sweet spot, if you can do it, at least for consumer hardware.

With batches, you need to watch out for aspect ratios being dropped during bucketing. If you do not have enough images in a bucket, they will not be used. So with 22 images, you can duplicate a few to get 24 and have 6 steps per epoch (assuming all images can be used), or you can do a repeat of 4, and have 22 steps per epoch (ensuring all images will be used).

Accumulation steps do not need to be evenly divisible by your steps in an epoch in OneTrainer, which can cause samples and saves to spill into the next epoch if it is not evenly divisible.

AdamW with stochastic rounding enabled and BF16 training type can give you 16bit calc speed, and decent quality. Not quite FP32, but not bad either. Stochastic rounding is only on a few optimizers though right now (AdamW, CAME, Adafactor). If you try CAME, note that initial testing on a finetune required very low learning rates with stochastic rounding turned on, like 2.5e-9 for a 1.5 finetune. It fried very quickly otherwise.

Z1-Gamer Apr 12, 2024
Author

I will test with batch 4 and gradient accumulation 4 today.

Just one question about bucketing and batches.
Say for example that I have images in 3 different resolutions in a dataset. Do the number of images for each of the resolutions need to be divisible by the batch size, or just the total number of images in the dataset? Sorry if I am asking very dumb questions :D.

Stochastic rounding is also on my list of things I need to try out. Once I get the finetuning figured out that is :D.

Again, thanks for the help! I really appreciate it.

Calamdor Apr 12, 2024

Generally speaking. Resolutions, no. Aspect ratios, yes. So if you have three 3x4 images at different resolutions, they should be bucketed together at the resolution you are training at (normally 1024px for SDXL). That still would not fill a batch size of 4 though. This is where the repeats can come in to help minimize the work required. Currently, I just pad all my images square to not have to focus so much on this part of prepping the dataset, as I found cropping exhausting after creating a few datasets.

There are exceptions to the above, multi-res training is one. Using resolution override is another.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune vs Lora settings. (need some help) #251

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finetune vs Lora settings. (need some help) #251

Z1-Gamer Apr 11, 2024

Replies: 1 comment · 4 replies

Calamdor Apr 11, 2024

Z1-Gamer Apr 11, 2024 Author

Calamdor Apr 11, 2024

Z1-Gamer Apr 12, 2024 Author

Calamdor Apr 12, 2024

Z1-Gamer
Apr 11, 2024

Replies: 1 comment 4 replies

Calamdor
Apr 11, 2024

Z1-Gamer Apr 11, 2024
Author

Z1-Gamer Apr 12, 2024
Author