Skip to content

Add Qwen-Image-Layered support for image decomposition into RGBA layers#302

Open
ZimengXiong wants to merge 3 commits intofilipstrand:mainfrom
ZimengXiong:main
Open

Add Qwen-Image-Layered support for image decomposition into RGBA layers#302
ZimengXiong wants to merge 3 commits intofilipstrand:mainfrom
ZimengXiong:main

Conversation

@ZimengXiong
Copy link

@ZimengXiong ZimengXiong commented Dec 22, 2025

Add Qwen-Image-Layered support for image decomposition into RGBA layers

This PR adds support for the Qwen-Image-Layered model

  • New CLI mflux-generate-qwen-layered
  • Decomposes images into N RGBA layers (default 4)
  • Supports 4, 6, and 8-bit quantization (~29GB with 6-bit vs ~55GB BF16)
  • Resolution buckets: 640 and 1024

Implements

  • RGBA-VAE (4-channel) with 3D temporal convolutions for layer handling
  • Layer3D RoPE: 3D positional encoding [layer, height, width]
  • Uses base QwenTransformer with extended RoPE for multi-layer sequences
  • Condition image encoded with layer_index=-1 for proper decomposition

Usage

mflux-generate-qwen-layered \
  --image input.png \
  --layers 4 \
  --steps 50 \
  -q 6 \
  --output-dir ./layers

Output: 4 RGBA PNG files (layer_0.png, layer_1.png, etc.) with transparency.

Requires local weights from Qwen/Qwen-Image-Layered:

  • ~55GB for full BF16 model
  • ~29GB with 6-bit quantization

Closes #299

Image Image Image

Quantized weights at https://huggingface.co/zimengxiong/Qwen-Image-Layered-6bit

This PR adds support for the Qwen-Image-Layered model, which decomposes an input
image into semantically disentangled RGBA layers for layer-based editing workflows.

## Features
- New CLI command: \`mflux-generate-qwen-layered\`
- Decomposes images into N RGBA layers (default 4)
- Supports 6-bit quantization for ~29GB memory usage (vs 55GB BF16)
- Resolution buckets: 640 and 1024

## Architecture
- RGBA-VAE (4-channel) with 3D temporal convolutions for layer handling
- Layer3D RoPE: 3D positional encoding [layer, height, width]
- Uses base QwenTransformer with extended RoPE for multi-layer sequences
- Condition image encoded as layer_index=-1 for proper decomposition

## New Files
- \`src/mflux/models/qwen_layered/\` - Full model implementation
  - \`model/qwen_layered_vae/\` - RGBA-VAE encoder/decoder
  - \`model/qwen_layered_transformer/\` - Layer3D RoPE
  - \`weights/\` - Weight mapping and definitions
  - \`variants/i2l/\` - Image-to-Layers pipeline
  - \`cli/\` - Command-line interface

## Usage
\`\`\`sh
mflux-generate-qwen-layered \\
  --image input.png \\
  --layers 4 \\
  --steps 50 \\
  -q 6 \\
  --output-dir ./layers
\`\`\`

## Documentation
Added comprehensive documentation to README.md including:
- TOC entry
- CLI argument reference
- Usage examples and tips
- Memory requirements

Tested on M4 Max 48GB with 6-bit quantization.
@filipstrand
Copy link
Owner

@ZimengXiong Really cool work! Have you compared this implementation directly to Diffusers? For example, for the same initial latent array, does it generate similar looking images?

@ZimengXiong
Copy link
Author

ZimengXiong commented Dec 22, 2025 via email

@filipstrand
Copy link
Owner

I feel your pain :) Currently I only have 32GB, but I'm waiting for a 256GB M3 Ultra machine which will hopefully arrive in 2 weeks or so, then I'll be able to properly try this out more properly.

@azrahello
Copy link
Contributor

I feel your pain :) Currently I only have 32GB, but I'm waiting for a 256GB M3 Ultra machine which will hopefully arrive in 2 weeks or so, then I'll be able to properly try this out more properly.

you should have gotten a few more by my reckoning.. :D. I can't wait, I fear there could be many possible improvements towards qwen and zimage. you did well to get that cut, it seems like an excellent compromise. I can't figure out if it's mlx being 'unripe' for handling qwen image and Z image, but with comfyui the times are more or less similar, with the difference that I can use higher resolutions in the same time and configuration on MPS

@filipstrand
Copy link
Owner

with the difference that I can use higher resolutions in the same time and configuration on MPS

@azrahello Interesting, would like to hear from you if the new VAE tiling strategy helps with this once it is released in v0.14.

@anthonywu
Copy link
Collaborator

Wow @ZimengXiong - amazing contribution!

@filipstrand
Copy link
Owner

My initial aim for v0.15 was to review and include this as well as the other qwen-image models (2511 etc), but since Klein was recently released, I chose to prioritise that. Also, I haven't read a lot of buzz regarding the layered model and how it is holding up (I still haven't had the time to play around with it yet), so I'm not 100% sure if it is worth investing in atm.

E.g with this image, I think this shows promise but tbh probably a bit too low quality extraction for anyone to actually use? (I'm thinking about the hands, for example. But maybe this varies a lot across images?)
Screenshot 2026-01-18 at 18 35 22

@Emasoft
Copy link

Emasoft commented Feb 25, 2026

Please merge this. The quality is very good. But you must also add the layers re-merge process.
The full pipeline should look like this:

  1. original image --> N layers images
  2. edit the layer image in any graphic program (i.e. Gimp, Photoshop, etc.)
  3. N layers images --> recomposed image (simple alpha compositing + img2img pass using qwen edit)

This way you can:

  • move things around
  • swap things
  • regenerate only specific things
  • insert objects into scenes behind other things
  • add or change logos or text on objects
  • change the layout of a page or illustration
  • remove obstructions without having to recreate the whole image
  • add people in a crowd or inside buildings (behind windows, bars, etc.)

The small errors like the hands will be easily fixed in the img2img pass using qwen edit.
Mflux should implement the command for step 1 and the command for step 3.

@filipstrand
Copy link
Owner

@Emasoft I appreciate your take. This makes me reconsider it. I've been putting off this a bit since I've not seen enough "buzz" about it and have been too busy to evaluate it myself (plus the fact that I've being trying to resist the urge to not accumulate too many weights, another 57GB in this case I think 😅).

But, of course, this is not a super strong argument to reject it. It sounds really useful for your use case, for example. One thing that would help in assessing quality of the implementation: Do you know if it compare well with the diffusers implementation? E.g if both produces similar artifacts (like those discussed above) then I'm much more likely to consider merging it without too much review effort - at least then we know it is due to the model itself and not the MLX implementation. Otherwise, I'm afraid this will require some A/B testing from my end that I unfortunately won't have time to prioritise in the near future.

The img2img use case you mention sounds nice, didn't think of that, but it makes sense now that you say it.
I have not tried this myself, but I guess you sacrifice a bit of the idea of "pixel perfect" editing of the original photo, but still maintain the overall composition of the new edit + gain quality since the last time-steps are regenerated in the end?

@Emasoft
Copy link

Emasoft commented Mar 3, 2026

@filipstrand Yes, the img2img step is necessary to "blend" the layers back together, just a minimal amount will do. And with a good img2img model, it will also fix the hands, feet, eyes, etc. automatically.
For the review effort, you are underestimating Claude Code. It can easily set up an image comparison pipeline and do all the comparisons for you. Just tell him to start using the reverse process, from creating the separate images with an alpha transparent background, to merging them into a single image, and ask Qwen-Layered to split them, and do a first comparison with a FLIP similarity score algorithm (tell it to do like this and he will set up everything). The same for the re-merge phase, making two versions of the merged image every time, with one layer element shifted randomly by 100-200px for the second phase comparison. example:

#Phase 1
Claude creates the layers:  
A
B
C
D
Claude Merges them in:
P
Qwen-Layered splits them back:
A'
B'
C'
D'
Claude compares them (FLIP > 90% to pass):
A = A'? 
B = B'?
C = C'?
D = D'?
Claude's verdict for the SPLIT PHASE is 'pass' if all 4 > 90%

# Phase 2
Claude offsets the original C layer by 100px up (or any random direction):
C --> Co
Claude merges the 4 original layers back into one image (Po):
A+B+Co+D = Po
Claude offsets the extracted C layer by 100px up (same direction used above by Co):
C' --> Co'
Claude merges the 4 extracted layers into Po':
A'+B'+Co'+D' = Po'
Claude runs an img2img for Po' and gets Po'':
Po' ---> img2img ---> Po''
Claude compares with Po with Po'' (FLIP > 90% to pass):
Po = Po''?
Claude's verdict on the MERGE PHASE is 'pass' if FLIP  > 90%

Repeat this loop for 100 random images (or actually 400 random images grouped by 4) and the number of passes is the percentage score you can use to evaluate the whole thing. Note that layer C must be a small object in the middle of the image (example a 200x100 human figure in a 1024x1024 image with transparent background).

@filipstrand
Copy link
Owner

@Emasoft As you say, I probably have to update my priors a little bit about what can be automated. Having coding agents actually inspect the image itself (or write a bunch of similarity metrics) actually start to feel very doable at this point.

At the same time, I also believe that as implementations themselves gets easier, the verification (manual or automatic) becomes a bigger part of the overall contribution, which I consider included in the PR. It is still much easer to review and accept a PR that has been clearly demonstrated to work in several cases. And this is also a great opportunity for community effort. For example, if one person contributes an implementation, another person might contribute "only" by testing it - but this work is indeed very important, and will become even more so over time I believe.

This being said, I think I can do a better job communicating this via contributor guidelines or similar (e.g highlighting what I typically look for when assessing correctness etc). Sometimes it might not actually be that extensive (for example, showing a few side-by-sides with diffusers greatly increases my confidence in the correctness of an implementation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]Qwen-Image-Layered

5 participants