Skip to content

v3.3.3 - more memory optimisations

Choose a tag to compare

@bghira bghira released this 24 Dec 15:16
· 843 commits to release since this release
7fb0126

Features

  • SDNQ quantisation engine for weights and optimisers
  • Musubi block swap expanded to cover auraflow, chroma, longcat-image, lumina2, omnigen, hidream, sana, sd3, and z-image
  • Kandinsky5 memory-efficient VAE now used instead of Diffusers' HunyuanVideo implementation (runs on consumer hw)
  • resolution_frames bucket strategy for video training so that multi-length dataset is possible with just a single config entry
  • WebUI: Training configuration wizard now allows filling in the number of checkpoints to keep
  • metadata will be written to the model / LoRA checkpoint for ComfyUI LoRA Auto Trigger Words node to make use of
  • OmniGen & Lumina2: TREAD, TwinFlow, and LayerSync
  • Qwen Image: experimental tiled attention support that avoids OOM in attention calc (disabled, have to enter the code to enable it for now)

Bugfixes

  • RamTorch
    • Now applies to text encoders properly (incl CLIP)
    • Extended to support Conv2D and Embedding layers (eg. SDXL offload)
    • Compatibility with Quanto (tested with int2, int4, int8-quanto)
    • System memory use reduction by not calculating gradients when requires_grad=False
  • Text encoder memory not unloading fixed for Qwen Image
  • No more quantize_via pipeline error when no quantisation is enabled
  • Qwen Image batch size > 1 training fixed (padded)
  • ROCm: bypass PyTorch bug for building kernels, enabling full Quanto compatibility (int2, int4, int8, fp8)

What's Changed

  • add metadata for ComfyUI-Lora-Auto-Trigger-Words node by @bghira in #2222
  • auraflow: implement musubi block swap by @bghira in #2227
  • chroma: implement musubi block swap by @bghira in #2228
  • longcat image: implement musubi block swap by @bghira in #2230
  • modernise lumina2 implementation with TREAD, block swapping, twinflow and layersync by @bghira in #2231
  • modernise omnigen implementation with TREAD, block swapping, twinflow and layersync by @bghira in #2232
  • pixart: implement musubi block swap by @bghira in #2233
  • add qwen-edit-2511 support, and an edit-v2+ flavour which enables 2511 features on 2509 by @bghira in #2223
  • hidream: implement musubi block swap by @bghira in #2234
  • sana & sanavideo: implement musubi block swap by @bghira in #2235
  • sd3: implement musubi block swap by @bghira in #2236
  • z-image turbo & omni: implement musubi block swap by @bghira in #2237
  • use kandinsky5 optimised VAE with added temporal roll and chunked conv3d by @bghira in #2229
  • when preparing model with offload enabled, do not move to accelerator by @bghira in #2238
  • docs: document SIMPLETUNER_JOB_ID env var for webhook job_id by @rafstahelin in #2239
  • sdnq quant engine by @bghira in #2225
  • fix error str vs int comparison by @bghira in #2241
  • fix error when quantize_via=pipeline but no_change level was provided by @bghira in #2242
  • ramtorch: when using it for text encoders, do not move to gpu by @bghira in #2244
  • add resolution_frames bucket strategy for video datasets so that different lengths can exist in one dataset by @bghira in #2240
  • add checkpoints total limit to wizard by @bghira in #2243
  • qwen image: fix padding for text embeds by @bghira in #2246
  • quanto: fix ROCm compiler error for int2-quanto; fix for RamTorch compatibility by @bghira in #2248
  • qwen image: tiled attention fallback when we hit OOM by @bghira in #2249
  • ramtorch: fix for gradient memory ballooning; fix text encoder application; extend for Conv2D and Embedding offload by @bghira in #2250
  • merge by @bghira in #2251

New Contributors

Full Changelog: v3.3.2...v3.3.3