Skip to content

Conversation

@leisuzz
Copy link
Contributor

@leisuzz leisuzz commented Dec 18, 2025

What does this PR do?

  1. I got the error:
    raise ValueError(f"Expected image_latents to be a list, got {type(image_latents)}.")
    (1) cond_model_input_list will go to "_prepare_image_ids" in a list of [[1, cond_model_input[0], cond_model_input[1], cond_model_input[2]], ...]
    (2) As the "_prepare_image_ids" in pipeline will do the torch.cat(image_latent_ids, dim=0), this will cause mismatch of shape in the training step in code model_input_ids = torch.cat([model_input_ids, cond_model_input_ids], dim=1). cond_model_input_ids .shape[0] is 1, but model_input_ids.shape[0] is the batch size. The code cond_model_input_ids.view is to resize the shape to meet the requirement
    So this change will also work if batch size is more than 1.

  2. When I only changed the cond_model_input to list, I got the training abnormal training loss (start with ~1.7, which is too high). So I fix model prediction based on the pipeline part, and loss becomes reasonable (start with ~0.4).

With the code:

orig_inp_shape = packed_noisy_model_input.shape
orig_inp_ids_shape = model_input_ids.shape
model_pred = model_pred[:, : orig_inp_shape[1], :]
model_input_ids = model_input_ids[:, : orig_inp_ids_shape[1], :]
The training loss is:
Steps:   0%|          | 0/3500 [00:00<?, ?it/s]
Steps:   0%|          | 1/3500 [00:08<8:28:06,  8.71s/it]
Steps:   0%|          | 1/3500 [00:13<8:28:06,  8.71s/it, loss=0.328, lr=1e-5]
Steps:   0%|          | 2/3500 [00:21<10:38:52, 10.96s/it, loss=0.328, lr=1e-5]
Steps:   0%|          | 2/3500 [00:26<10:38:52, 10.96s/it, loss=0.835, lr=1e-5]
Steps:   0%|          | 3/3500 [00:34<11:30:01, 11.84s/it, loss=0.835, lr=1e-5]
Steps:   0%|          | 3/3500 [00:39<11:30:01, 11.84s/it, loss=0.254, lr=1e-5]
Steps:   0%|          | 4/3500 [00:46<11:52:09, 12.22s/it, loss=0.254, lr=1e-5]
Steps:   0%|          | 4/3500 [00:52<11:52:09, 12.22s/it, loss=0.405, lr=1e-5]
Steps:   0%|          | 5/3500 [00:59<12:05:53, 12.46s/it, loss=0.405, lr=1e-5]
Steps:   0%|          | 5/3500 [01:05<12:05:53, 12.46s/it, loss=1.03, lr=1e-5] 
Steps:   0%|          | 6/3500 [01:12<12:15:03, 12.62s/it, loss=1.03, lr=1e-5]
Steps:   0%|          | 6/3500 [01:18<12:15:03, 12.62s/it, loss=0.574, lr=1e-5]
Steps:   0%|          | 7/3500 [01:25<12:20:52, 12.73s/it, loss=0.574, lr=1e-5]
Steps:   0%|          | 7/3500 [01:31<12:20:52, 12.73s/it, loss=0.29, lr=1e-5] 
Steps:   0%|          | 8/3500 [01:38<12:24:32, 12.79s/it, loss=0.29, lr=1e-5]
Steps:   0%|          | 8/3500 [01:44<12:24:32, 12.79s/it, loss=0.393, lr=1e-5]
Steps:   0%|          | 9/3500 [01:51<12:27:38, 12.85s/it, loss=0.393, lr=1e-5]
Steps:   0%|          | 9/3500 [01:57<12:27:38, 12.85s/it, loss=0.336, lr=1e-5]

With the original code:

model_pred = model_pred[:, : packed_noisy_model_input.size(1) :]
model_pred = Flux2Pipeline._unpack_latents_with_ids(model_pred, model_input_ids)

The training loss is:

Steps:   0%|          | 1/5000 [00:46<64:57:32, 46.78s/it]
Steps:   0%|          | 1/5000 [00:46<64:57:32, 46.78s/it, loss=2.01, lr=1e-5]
Steps:   0%|          | 2/5000 [01:15<50:29:04, 36.36s/it, loss=2.01, lr=1e-5]
Steps:   0%|          | 2/5000 [01:15<50:29:04, 36.36s/it, loss=2.08, lr=1e-5]
Steps:   0%|          | 3/5000 [01:47<47:31:01, 34.23s/it, loss=2.08, lr=1e-5]
Steps:   0%|          | 3/5000 [01:47<47:31:01, 34.23s/it, loss=1.83, lr=1e-5]
Steps:   0%|          | 4/5000 [02:18<45:54:39, 33.08s/it, loss=1.83, lr=1e-5]
Steps:   0%|          | 4/5000 [02:18<45:54:39, 33.08s/it, loss=1.99, lr=1e-5]
Steps:   0%|          | 5/5000 [02:47<43:39:23, 31.46s/it, loss=1.99, lr=1e-5]
Steps:   0%|          | 5/5000 [02:47<43:39:23, 31.46s/it, loss=2.02, lr=1e-5]
Steps:   0%|          | 6/5000 [03:16<42:28:13, 30.62s/it, loss=2.02, lr=1e-5]
Steps:   0%|          | 6/5000 [03:16<42:28:13, 30.62s/it, loss=2.01, lr=1e-5]
Steps:   0%|          | 7/5000 [03:42<40:32:24, 29.23s/it, loss=2.01, lr=1e-5]
Steps:   0%|          | 7/5000 [03:42<40:32:24, 29.23s/it, loss=1.83, lr=1e-5]
Steps:   0%|          | 8/5000 [04:12<40:37:29, 29.30s/it, loss=1.83, lr=1e-5]
Steps:   0%|          | 8/5000 [04:12<40:37:29, 29.30s/it, loss=1.92, lr=1e-5]

Co-authored-by: @tcaimm

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Co-authored-by: @tcaimm

@leisuzz leisuzz changed the title Bugfix for dreambooth flux2 img2img2 Bugfix for flux2 img2img2 prediction Dec 18, 2025
@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 18, 2025

@sayakpaul Please take a look at this PR. Thank you for your help!

@sayakpaul
Copy link
Member

Do you have a reproducer?

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 18, 2025

@sayakpaul I've updated the result in the description, thanks :)

@leisuzz
Copy link
Contributor Author

leisuzz commented Jan 5, 2026

@linoytsaban Please take a look at this PR. Thank you for your help!

@tcaimm
Copy link
Contributor

tcaimm commented Jan 11, 2026

I noticed the anomaly in the loss statement a while ago. The main issue is that the img2img training logic requires concatenating the condition along the token dimension, and then pruning the condition information when calculating the loss. Flux2 operates on both model_input and model_input_ids. The initial size needs to be recorded before input, similar to the operation in train_dreambooth_lora_flux_kontext.py. The img2img training seems to be a simple copy and modification of txt2img, forgetting to prune the output. The modifications after line 1727 are as follows:

# concatenate the model inputs with the cond inputs
orig_inp_shape = packed_noisy_model_input.shape
orig_inp_ids_shape = model_input_ids.shape
packed_noisy_model_input = torch.cat([packed_noisy_model_input, packed_cond_model_input], dim=1)
model_input_ids = torch.cat([model_input_ids, cond_model_input_ids], dim=1)

# handle guidance
guidance = torch.full([1], args.guidance_scale, device=accelerator.device)
guidance = guidance.expand(model_input.shape[0])

# Predict the noise residual
model_pred = transformer(
...

model_pred = model_pred[:, : orig_inp_shape[1], :]
model_input_ids = model_input_ids[:, : orig_inp_ids_shape[1], :]
model_pred = Flux2Pipeline._unpack_latents_with_ids(model_pred, model_input_ids)

This is the core problem with this training script; please fix it as soon as possible.
@sayakpaul,@linoytsaban Thank you for your help.

@sayakpaul
Copy link
Member

@tcaimm thanks for pointing that out. Since you have already characterized the bug and proposed a solution would you like to open a PR? That way, your contribution will stay within the library :-)

@leisuzz
Copy link
Contributor Author

leisuzz commented Jan 12, 2026

@sayakpaul probably I can add @tcaimm as the co-auther for this PR after I change the line:
model_input_ids = model_input_ids[:, :noisy_len:]
And the lines:

cond_model_input_list = [cond_model_input[i].unsqueeze(0) for i in range(cond_model_input.shape[0])]
cond_model_input_ids = Flux2Pipeline._prepare_image_ids(cond_model_input_list).to(
                    device=cond_model_input.device
                )
cond_model_input_ids = cond_model_input_ids.view(
                    cond_model_input.shape[0], -1, model_input_ids.shape[-1]
                )

are still needed.

@sayakpaul
Copy link
Member

Sure that works.

@leisuzz
Copy link
Contributor Author

leisuzz commented Jan 12, 2026

@sayakpaul @tcaimm Please take a look

@tcaimm
Copy link
Contributor

tcaimm commented Jan 12, 2026

@sayakpaul @tcaimm请看一下

Thanks for the update! I’ve taken a look at the changes, and they look great to me.

Since we collaborated on this, would you mind adding me as a co-author in the final squash/merge commit? This helps GitHub track the contribution correctly. You can add this line to the bottom of the commit message:

Co-authored-by: tcaimm [email protected]

Looking forward to seeing this merged!

@leisuzz
Copy link
Contributor Author

leisuzz commented Jan 12, 2026

@sayakpaul @tcaimm请看一下

Thanks for the update! I’ve taken a look at the changes, and they look great to me.

Since we collaborated on this, would you mind adding me as a co-author in the final squash/merge commit? This helps GitHub track the contribution correctly. You can add this line to the bottom of the commit message:

Co-authored-by: tcaimm [email protected]

Looking forward to seeing this merged!

Done!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@linoytsaban linoytsaban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @leisuzz!

@sayakpaul sayakpaul merged commit 29a930a into huggingface:main Jan 12, 2026
24 of 26 checks passed
@sayakpaul
Copy link
Member

Thanks for the awesome contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants