v3.3.3 - more memory optimisations
Features
- SDNQ quantisation engine for weights and optimisers
- Musubi block swap expanded to cover auraflow, chroma, longcat-image, lumina2, omnigen, hidream, sana, sd3, and z-image
- Kandinsky5 memory-efficient VAE now used instead of Diffusers' HunyuanVideo implementation (runs on consumer hw)
resolution_framesbucket strategy for video training so that multi-length dataset is possible with just a single config entry- WebUI: Training configuration wizard now allows filling in the number of checkpoints to keep
- metadata will be written to the model / LoRA checkpoint for ComfyUI LoRA Auto Trigger Words node to make use of
- OmniGen & Lumina2: TREAD, TwinFlow, and LayerSync
- Qwen Image: experimental tiled attention support that avoids OOM in attention calc (disabled, have to enter the code to enable it for now)
Bugfixes
- RamTorch
- Now applies to text encoders properly (incl CLIP)
- Extended to support Conv2D and Embedding layers (eg. SDXL offload)
- Compatibility with Quanto (tested with int2, int4, int8-quanto)
- System memory use reduction by not calculating gradients when
requires_grad=False
- Text encoder memory not unloading fixed for Qwen Image
- No more quantize_via pipeline error when no quantisation is enabled
- Qwen Image batch size > 1 training fixed (padded)
- ROCm: bypass PyTorch bug for building kernels, enabling full Quanto compatibility (int2, int4, int8, fp8)
What's Changed
- add metadata for ComfyUI-Lora-Auto-Trigger-Words node by @bghira in #2222
- auraflow: implement musubi block swap by @bghira in #2227
- chroma: implement musubi block swap by @bghira in #2228
- longcat image: implement musubi block swap by @bghira in #2230
- modernise lumina2 implementation with TREAD, block swapping, twinflow and layersync by @bghira in #2231
- modernise omnigen implementation with TREAD, block swapping, twinflow and layersync by @bghira in #2232
- pixart: implement musubi block swap by @bghira in #2233
- add qwen-edit-2511 support, and an edit-v2+ flavour which enables 2511 features on 2509 by @bghira in #2223
- hidream: implement musubi block swap by @bghira in #2234
- sana & sanavideo: implement musubi block swap by @bghira in #2235
- sd3: implement musubi block swap by @bghira in #2236
- z-image turbo & omni: implement musubi block swap by @bghira in #2237
- use kandinsky5 optimised VAE with added temporal roll and chunked conv3d by @bghira in #2229
- when preparing model with offload enabled, do not move to accelerator by @bghira in #2238
- docs: document SIMPLETUNER_JOB_ID env var for webhook job_id by @rafstahelin in #2239
- sdnq quant engine by @bghira in #2225
- fix error str vs int comparison by @bghira in #2241
- fix error when quantize_via=pipeline but no_change level was provided by @bghira in #2242
- ramtorch: when using it for text encoders, do not move to gpu by @bghira in #2244
- add resolution_frames bucket strategy for video datasets so that different lengths can exist in one dataset by @bghira in #2240
- add checkpoints total limit to wizard by @bghira in #2243
- qwen image: fix padding for text embeds by @bghira in #2246
- quanto: fix ROCm compiler error for int2-quanto; fix for RamTorch compatibility by @bghira in #2248
- qwen image: tiled attention fallback when we hit OOM by @bghira in #2249
- ramtorch: fix for gradient memory ballooning; fix text encoder application; extend for Conv2D and Embedding offload by @bghira in #2250
- merge by @bghira in #2251
New Contributors
- @rafstahelin made their first contribution in #2239
Full Changelog: v3.3.2...v3.3.3