-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Separate merge ratio, downscaling and stride per module. #44
Comments
Thank you for the suggestion! I never thought about applying different ratios to the different modules. I wonder though, do you get any speed benefit from doing this? In my testing, the MLP and Cross-Attention didn't contribute almost anything to the overall time taken, so merging it wasn't useful. Though, I only tested this at 512x512, so I don't know about higher resolutions. |
In my testing it did have a small performance benefit, talking about 0.xx it/s to maybe 1 it/s difference here but I think that every small boost no matter how insignificant matters. From what I noticed attention has the biggest impact, mlp has the second most and cross-attention has very little. The main benefit would be that sometimes merging the mlp/cross-attention along with the attention can fix hands, faces or just produce interesting variations. So yeah, small boost to performance I'd say but the additional control is good to have for experimenting with generations and possibly fixing them up a bit without needing inpainting/adetailer/tileresample. The reason I thought of this is due to cross-attention having no downsides being merged at 0.9, might as well merge it, but attention starts getting weird at about 0.6-0.7 and mlp even sooner. That and because I've gotten very nice generations by just toying with the ratios, downscaling and stride. I thought it'd be nice to be able to compress as much as I can and have individual control over each module so that I can explore further. |
To update on this: I've done a bunch more testing and yeah it does provide very nice generations, fixing random details and improving compositions or simply improving faces/hands, as for the limits? for example not only is it possible to merge cross attention at 0.9 but also you could downscale it by x8 if desired and the image quality degradation is often unnoticeable, in fact it is good for variations and possibly fixing details. I often find success with these values but naturally it is seed, model, VAE, optimization and GPU dependent. If I find I do not like it then I tweak the ratio/downscale a bit or generate a couple variations as the process of merging makes generations on same seeds variably indeterministic, it is too bad that currently these modules cannot be merged separately. From what I've noticed higher Attention compression degrading generations can be fixed/mitigated by merging MLP and/or Cross-Attention along with it, even with the limited control due to the merging ratio and downscaling being global I often notice what would be a highly degraded generation (if merging with only one module) fixed and even improved with some luck and by merging other modules, sometimes downscaling can help or produce an even better result. I hope this can eventually be added, not necessarily for performance, or well could be as MLP/Cross-Attention (very small performance boosts) merging can allow for higher Attention merging (which is the main performance booster) due to higher ratios not degrading generations as much when modules are merged together at correct ratios and downscales. And this is all without even mentioning Stride. The variation possibilities are very interesting. |
I see, thanks for all the testing! I'm totally on board with adding finer control, but the issue is the interface. Currently, there are a lot of variables even without being able to tune parameters specific to the modules. Actually, I think if I were to add something like this, I'd let you change the ratios at a per-block level. In the stable diffusion network, there are 4 layers at downsample 1, 4 layers at downsample 2, etc. and currently there's no granularity to say, "ok, the first 3 downsample 1 layers should have x% ratio, but the last should have y%", etc. If I'm going to add a "do anything" option, then might as well go all the way, right? What do you think about an optional "config" string, which would let you set the parameters as granular as you wanted. For instance:
would set layer 1's attn ( There's an option for reducing verbosity by allowing you to specify multiple layers / modules with the same "operation":
could set the the merging ratio to Would that be useful? Idk how many people would use that lol. |
Yeah that would be great if implemented and I think a lot of people who experiment with settings in general would use it, I personally tend to do that cause I like finding optimal configurations. |
Currently as it is the token merging is done globally using a slider, from what I've noticed merging at different ratios would provide more optimal results as you can then have another small speed boost and have more control over your generations, perhaps even improve them as sometimes merging these modules can have a positive effect on details such as hands and faces or just produce more interesting results.
Example would be:
Cross-Attention merged at 0.9 (yes, it works perfectly at that ratio in my testing), downscale 2.
Attention merged at 0.3-0.6, downscale 1.
MLP merged at 0.1-0.3, downscale 1.
Downscaling can likely be done further if done individually per module, I haven't tested it much.
The text was updated successfully, but these errors were encountered: