Add Qwen-Image-Layered support for image decomposition into RGBA layers by ZimengXiong · Pull Request #302 · filipstrand/mflux

ZimengXiong · 2025-12-22T16:17:02Z

Add Qwen-Image-Layered support for image decomposition into RGBA layers

This PR adds support for the Qwen-Image-Layered model

New CLI mflux-generate-qwen-layered
Decomposes images into N RGBA layers (default 4)
Supports 4, 6, and 8-bit quantization (~29GB with 6-bit vs ~55GB BF16)
Resolution buckets: 640 and 1024

Implements

RGBA-VAE (4-channel) with 3D temporal convolutions for layer handling
Layer3D RoPE: 3D positional encoding [layer, height, width]
Uses base QwenTransformer with extended RoPE for multi-layer sequences
Condition image encoded with layer_index=-1 for proper decomposition

Usage

mflux-generate-qwen-layered \
  --image input.png \
  --layers 4 \
  --steps 50 \
  -q 6 \
  --output-dir ./layers

Output: 4 RGBA PNG files (layer_0.png, layer_1.png, etc.) with transparency.

Requires local weights from Qwen/Qwen-Image-Layered:

~55GB for full BF16 model
~29GB with 6-bit quantization

Closes #299

Quantized weights at https://huggingface.co/zimengxiong/Qwen-Image-Layered-6bit

This PR adds support for the Qwen-Image-Layered model, which decomposes an input image into semantically disentangled RGBA layers for layer-based editing workflows. ## Features - New CLI command: \`mflux-generate-qwen-layered\` - Decomposes images into N RGBA layers (default 4) - Supports 6-bit quantization for ~29GB memory usage (vs 55GB BF16) - Resolution buckets: 640 and 1024 ## Architecture - RGBA-VAE (4-channel) with 3D temporal convolutions for layer handling - Layer3D RoPE: 3D positional encoding [layer, height, width] - Uses base QwenTransformer with extended RoPE for multi-layer sequences - Condition image encoded as layer_index=-1 for proper decomposition ## New Files - \`src/mflux/models/qwen_layered/\` - Full model implementation - \`model/qwen_layered_vae/\` - RGBA-VAE encoder/decoder - \`model/qwen_layered_transformer/\` - Layer3D RoPE - \`weights/\` - Weight mapping and definitions - \`variants/i2l/\` - Image-to-Layers pipeline - \`cli/\` - Command-line interface ## Usage \`\`\`sh mflux-generate-qwen-layered \\ --image input.png \\ --layers 4 \\ --steps 50 \\ -q 6 \\ --output-dir ./layers \`\`\` ## Documentation Added comprehensive documentation to README.md including: - TOC entry - CLI argument reference - Usage examples and tips - Memory requirements Tested on M4 Max 48GB with 6-bit quantization.

filipstrand · 2025-12-22T17:31:13Z

@ZimengXiong Really cool work! Have you compared this implementation directly to Diffusers? For example, for the same initial latent array, does it generate similar looking images?

ZimengXiong · 2025-12-22T17:38:01Z

No I have not, I haven't gotten diffusers to work with quantized models yet (limited on RAM, only 48GB), and diffusers doesn't support true 8/4bit Quantizing on Mac. I could compare with [ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) with some of the quantized GGUFs out there like the ComfyUI ones. Out right now, will take a look later this week. Its also REALLY slow because the layers exponentially increase compute time, around 45min per 50its, M4 Max 40c.

…

On Dec 22, 2025, at 09:31:35, Filip Strand ***@***.***> wrote: filipstrand left a comment (filipstrand/mflux#302) <#302 (comment)> @ZimengXiong <https://github.com/ZimengXiong> Really cool work! Have you compared this implementation directly to Diffusers? For example, for the same initial latent array, does it generate similar looking images? — Reply to this email directly, view it on GitHub <#302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT7G33FSFOOPJYCT6LTHMT34DATHPAVCNFSM6AAAAACPYQI2OCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBTGA2TENBSGE>. You are receiving this because you were mentioned.

filipstrand · 2025-12-22T17:55:16Z

I feel your pain :) Currently I only have 32GB, but I'm waiting for a 256GB M3 Ultra machine which will hopefully arrive in 2 weeks or so, then I'll be able to properly try this out more properly.

azrahello · 2025-12-23T09:13:50Z

I feel your pain :) Currently I only have 32GB, but I'm waiting for a 256GB M3 Ultra machine which will hopefully arrive in 2 weeks or so, then I'll be able to properly try this out more properly.

you should have gotten a few more by my reckoning.. :D. I can't wait, I fear there could be many possible improvements towards qwen and zimage. you did well to get that cut, it seems like an excellent compromise. I can't figure out if it's mlx being 'unripe' for handling qwen image and Z image, but with comfyui the times are more or less similar, with the difference that I can use higher resolutions in the same time and configuration on MPS

filipstrand · 2025-12-23T12:38:12Z

with the difference that I can use higher resolutions in the same time and configuration on MPS

@azrahello Interesting, would like to hear from you if the new VAE tiling strategy helps with this once it is released in v0.14.

anthonywu · 2025-12-24T17:58:26Z

Wow @ZimengXiong - amazing contribution!

filipstrand · 2026-01-18T17:39:20Z

My initial aim for v0.15 was to review and include this as well as the other qwen-image models (2511 etc), but since Klein was recently released, I chose to prioritise that. Also, I haven't read a lot of buzz regarding the layered model and how it is holding up (I still haven't had the time to play around with it yet), so I'm not 100% sure if it is worth investing in atm.

E.g with this image, I think this shows promise but tbh probably a bit too low quality extraction for anyone to actually use? (I'm thinking about the hands, for example. But maybe this varies a lot across images?)

Emasoft · 2026-02-25T05:26:28Z

Please merge this. The quality is very good. But you must also add the layers re-merge process.
The full pipeline should look like this:

original image --> N layers images
edit the layer image in any graphic program (i.e. Gimp, Photoshop, etc.)
N layers images --> recomposed image (simple alpha compositing + img2img pass using qwen edit)

This way you can:

move things around
swap things
regenerate only specific things
insert objects into scenes behind other things
add or change logos or text on objects
change the layout of a page or illustration
remove obstructions without having to recreate the whole image
add people in a crowd or inside buildings (behind windows, bars, etc.)

The small errors like the hands will be easily fixed in the img2img pass using qwen edit.
Mflux should implement the command for step 1 and the command for step 3.

filipstrand · 2026-03-02T21:18:11Z

@Emasoft I appreciate your take. This makes me reconsider it. I've been putting off this a bit since I've not seen enough "buzz" about it and have been too busy to evaluate it myself (plus the fact that I've being trying to resist the urge to not accumulate too many weights, another 57GB in this case I think 😅).

But, of course, this is not a super strong argument to reject it. It sounds really useful for your use case, for example. One thing that would help in assessing quality of the implementation: Do you know if it compare well with the diffusers implementation? E.g if both produces similar artifacts (like those discussed above) then I'm much more likely to consider merging it without too much review effort - at least then we know it is due to the model itself and not the MLX implementation. Otherwise, I'm afraid this will require some A/B testing from my end that I unfortunately won't have time to prioritise in the near future.

The img2img use case you mention sounds nice, didn't think of that, but it makes sense now that you say it.
I have not tried this myself, but I guess you sacrifice a bit of the idea of "pixel perfect" editing of the original photo, but still maintain the overall composition of the new edit + gain quality since the last time-steps are regenerated in the end?

Emasoft · 2026-03-03T20:57:47Z

@filipstrand Yes, the img2img step is necessary to "blend" the layers back together, just a minimal amount will do. And with a good img2img model, it will also fix the hands, feet, eyes, etc. automatically.
For the review effort, you are underestimating Claude Code. It can easily set up an image comparison pipeline and do all the comparisons for you. Just tell him to start using the reverse process, from creating the separate images with an alpha transparent background, to merging them into a single image, and ask Qwen-Layered to split them, and do a first comparison with a FLIP similarity score algorithm (tell it to do like this and he will set up everything). The same for the re-merge phase, making two versions of the merged image every time, with one layer element shifted randomly by 100-200px for the second phase comparison. example:

#Phase 1
Claude creates the layers:  
A
B
C
D
Claude Merges them in:
P
Qwen-Layered splits them back:
A'
B'
C'
D'
Claude compares them (FLIP > 90% to pass):
A = A'? 
B = B'?
C = C'?
D = D'?
Claude's verdict for the SPLIT PHASE is 'pass' if all 4 > 90%

# Phase 2
Claude offsets the original C layer by 100px up (or any random direction):
C --> Co
Claude merges the 4 original layers back into one image (Po):
A+B+Co+D = Po
Claude offsets the extracted C layer by 100px up (same direction used above by Co):
C' --> Co'
Claude merges the 4 extracted layers into Po':
A'+B'+Co'+D' = Po'
Claude runs an img2img for Po' and gets Po'':
Po' ---> img2img ---> Po''
Claude compares with Po with Po'' (FLIP > 90% to pass):
Po = Po''?
Claude's verdict on the MERGE PHASE is 'pass' if FLIP  > 90%

Repeat this loop for 100 random images (or actually 400 random images grouped by 4) and the number of passes is the percentage score you can use to evaluate the whole thing. Note that layer C must be a small object in the middle of the image (example a 200x100 human figure in a 1024x1024 image with transparent background).

filipstrand · 2026-03-04T21:32:57Z

@Emasoft As you say, I probably have to update my priors a little bit about what can be automated. Having coding agents actually inspect the image itself (or write a bunch of similarity metrics) actually start to feel very doable at this point.

At the same time, I also believe that as implementations themselves gets easier, the verification (manual or automatic) becomes a bigger part of the overall contribution, which I consider included in the PR. It is still much easer to review and accept a PR that has been clearly demonstrated to work in several cases. And this is also a great opportunity for community effort. For example, if one person contributes an implementation, another person might contribute "only" by testing it - but this work is indeed very important, and will become even more so over time I believe.

This being said, I think I can do a better job communicating this via contributor guidelines or similar (e.g highlighting what I typically look for when assessing correctness etc). Sometimes it might not actually be that extensive (for example, showing a few side-by-sides with diffusers greatly increases my confidence in the correctness of an implementation).

ZimengXiong added 3 commits December 22, 2025 08:08

feat: Introduce qwen-image-layered model saving

067ba25

Add chunked saving for low memory, add pre-quantized models to README

a255e4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen-Image-Layered support for image decomposition into RGBA layers#302

Add Qwen-Image-Layered support for image decomposition into RGBA layers#302
ZimengXiong wants to merge 3 commits intofilipstrand:mainfrom
ZimengXiong:main

ZimengXiong commented Dec 22, 2025 •

edited

Loading

Uh oh!

filipstrand commented Dec 22, 2025

Uh oh!

ZimengXiong commented Dec 22, 2025 via email •

edited

Loading

Uh oh!

filipstrand commented Dec 22, 2025

Uh oh!

azrahello commented Dec 23, 2025

Uh oh!

filipstrand commented Dec 23, 2025

Uh oh!

anthonywu commented Dec 24, 2025

Uh oh!

filipstrand commented Jan 18, 2026

Uh oh!

Emasoft commented Feb 25, 2026

Uh oh!

filipstrand commented Mar 2, 2026

Uh oh!

Emasoft commented Mar 3, 2026 •

edited

Loading

Uh oh!

filipstrand commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ZimengXiong commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Qwen-Image-Layered support for image decomposition into RGBA layers

Usage

Uh oh!

filipstrand commented Dec 22, 2025

Uh oh!

ZimengXiong commented Dec 22, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filipstrand commented Dec 22, 2025

Uh oh!

azrahello commented Dec 23, 2025

Uh oh!

filipstrand commented Dec 23, 2025

Uh oh!

anthonywu commented Dec 24, 2025

Uh oh!

filipstrand commented Jan 18, 2026

Uh oh!

Emasoft commented Feb 25, 2026

Uh oh!

filipstrand commented Mar 2, 2026

Uh oh!

Emasoft commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filipstrand commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZimengXiong commented Dec 22, 2025 •

edited

Loading

ZimengXiong commented Dec 22, 2025 via email •

edited

Loading

Emasoft commented Mar 3, 2026 •

edited

Loading