Add Qwen-Image-Layered support for image decomposition into RGBA layers#302
Add Qwen-Image-Layered support for image decomposition into RGBA layers#302ZimengXiong wants to merge 3 commits intofilipstrand:mainfrom
Conversation
This PR adds support for the Qwen-Image-Layered model, which decomposes an input image into semantically disentangled RGBA layers for layer-based editing workflows. ## Features - New CLI command: \`mflux-generate-qwen-layered\` - Decomposes images into N RGBA layers (default 4) - Supports 6-bit quantization for ~29GB memory usage (vs 55GB BF16) - Resolution buckets: 640 and 1024 ## Architecture - RGBA-VAE (4-channel) with 3D temporal convolutions for layer handling - Layer3D RoPE: 3D positional encoding [layer, height, width] - Uses base QwenTransformer with extended RoPE for multi-layer sequences - Condition image encoded as layer_index=-1 for proper decomposition ## New Files - \`src/mflux/models/qwen_layered/\` - Full model implementation - \`model/qwen_layered_vae/\` - RGBA-VAE encoder/decoder - \`model/qwen_layered_transformer/\` - Layer3D RoPE - \`weights/\` - Weight mapping and definitions - \`variants/i2l/\` - Image-to-Layers pipeline - \`cli/\` - Command-line interface ## Usage \`\`\`sh mflux-generate-qwen-layered \\ --image input.png \\ --layers 4 \\ --steps 50 \\ -q 6 \\ --output-dir ./layers \`\`\` ## Documentation Added comprehensive documentation to README.md including: - TOC entry - CLI argument reference - Usage examples and tips - Memory requirements Tested on M4 Max 48GB with 6-bit quantization.
|
@ZimengXiong Really cool work! Have you compared this implementation directly to Diffusers? For example, for the same initial latent array, does it generate similar looking images? |
|
No I have not, I haven't gotten diffusers to work with quantized models yet (limited on RAM, only 48GB), and diffusers doesn't support true 8/4bit Quantizing on Mac. I could compare with [ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) with some of the quantized GGUFs out there like the ComfyUI ones. Out right now, will take a look later this week. Its also REALLY slow because the layers exponentially increase compute time, around 45min per 50its, M4 Max 40c.
… On Dec 22, 2025, at 09:31:35, Filip Strand ***@***.***> wrote:
filipstrand
left a comment
(filipstrand/mflux#302)
<#302 (comment)>
@ZimengXiong <https://github.com/ZimengXiong> Really cool work! Have you compared this implementation directly to Diffusers? For example, for the same initial latent array, does it generate similar looking images?
—
Reply to this email directly, view it on GitHub <#302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT7G33FSFOOPJYCT6LTHMT34DATHPAVCNFSM6AAAAACPYQI2OCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBTGA2TENBSGE>.
You are receiving this because you were mentioned.
|
|
I feel your pain :) Currently I only have 32GB, but I'm waiting for a 256GB M3 Ultra machine which will hopefully arrive in 2 weeks or so, then I'll be able to properly try this out more properly. |
you should have gotten a few more by my reckoning.. :D. I can't wait, I fear there could be many possible improvements towards qwen and zimage. you did well to get that cut, it seems like an excellent compromise. I can't figure out if it's mlx being 'unripe' for handling qwen image and Z image, but with comfyui the times are more or less similar, with the difference that I can use higher resolutions in the same time and configuration on MPS |
@azrahello Interesting, would like to hear from you if the new VAE tiling strategy helps with this once it is released in v0.14. |
|
Wow @ZimengXiong - amazing contribution! |
|
Please merge this. The quality is very good. But you must also add the layers re-merge process.
This way you can:
The small errors like the hands will be easily fixed in the img2img pass using qwen edit. |
|
@Emasoft I appreciate your take. This makes me reconsider it. I've been putting off this a bit since I've not seen enough "buzz" about it and have been too busy to evaluate it myself (plus the fact that I've being trying to resist the urge to not accumulate too many weights, another 57GB in this case I think 😅). But, of course, this is not a super strong argument to reject it. It sounds really useful for your use case, for example. One thing that would help in assessing quality of the implementation: Do you know if it compare well with the diffusers implementation? E.g if both produces similar artifacts (like those discussed above) then I'm much more likely to consider merging it without too much review effort - at least then we know it is due to the model itself and not the MLX implementation. Otherwise, I'm afraid this will require some A/B testing from my end that I unfortunately won't have time to prioritise in the near future. The img2img use case you mention sounds nice, didn't think of that, but it makes sense now that you say it. |
|
@filipstrand Yes, the img2img step is necessary to "blend" the layers back together, just a minimal amount will do. And with a good img2img model, it will also fix the hands, feet, eyes, etc. automatically. Repeat this loop for 100 random images (or actually 400 random images grouped by 4) and the number of passes is the percentage score you can use to evaluate the whole thing. Note that layer C must be a small object in the middle of the image (example a 200x100 human figure in a 1024x1024 image with transparent background). |
|
@Emasoft As you say, I probably have to update my priors a little bit about what can be automated. Having coding agents actually inspect the image itself (or write a bunch of similarity metrics) actually start to feel very doable at this point. At the same time, I also believe that as implementations themselves gets easier, the verification (manual or automatic) becomes a bigger part of the overall contribution, which I consider included in the PR. It is still much easer to review and accept a PR that has been clearly demonstrated to work in several cases. And this is also a great opportunity for community effort. For example, if one person contributes an implementation, another person might contribute "only" by testing it - but this work is indeed very important, and will become even more so over time I believe. This being said, I think I can do a better job communicating this via contributor guidelines or similar (e.g highlighting what I typically look for when assessing correctness etc). Sometimes it might not actually be that extensive (for example, showing a few side-by-sides with diffusers greatly increases my confidence in the correctness of an implementation). |

Add Qwen-Image-Layered support for image decomposition into RGBA layers
This PR adds support for the Qwen-Image-Layered model
mflux-generate-qwen-layeredImplements
[layer, height, width]QwenTransformerwith extended RoPE for multi-layer sequenceslayer_index=-1for proper decompositionUsage
Output: 4 RGBA PNG files (
layer_0.png,layer_1.png, etc.) with transparency.Requires local weights from
Qwen/Qwen-Image-Layered:Closes #299
Quantized weights at https://huggingface.co/zimengxiong/Qwen-Image-Layered-6bit