Skip to content

Conversation

@lantudou
Copy link

The code used torch.linalg.svd to compute the full singular value decomposition of the weight matrix. This proved to be computationally expensive and memory-intensive, especially since we only utilize the top rank components.

This PR replaces it with torch.svd_lowrank, utilizing a randomized algorithm to approximate the dominant singular values efficiently.

Changes:
Switched to torch.svd_lowrank for faster decomposition.
Set niter=4 and q=10 (oversampling) to achieve an optimal balance between speed and accuracy.
Adjusted tensor transposition logic to align with svd_lowrank's output format (which returns v instead of Vh).

Performance Impact In my local environment, this optimization significantly eliminates a major bottleneck during quantization:

Low-rank creation latency: Dropped from ~5s to ~100ms.
Total Runtime: The overall calibration and quantization process is approximately 6x faster.

@lantudou
Copy link
Author

lantudou commented Nov 26, 2025

Upon analyzing the logs from the modified Smooth_Diffusion, I observed a minor deviation in the calculated error compared to the original full SVD method.

While this may slightly shift the optimal hyperparameters in the grid search, the computational overhead of the original full SVD was prohibitive. It previously made quantizing large models like Flux extremely time-consuming, forcing us to compromise on grid search granularity. #35

Therefore, I consider this minor difference in results to be an acceptable trade-off for the efficiency gains

@otherV
Copy link

otherV commented Nov 27, 2025

Please provide some generated images for comparison between base model, quantized model and your variation of quantization. If there is no/minimal visible quality loss for same seed, this could save bandwidth in certain scenarios. Thank you for your contribution.

@lantudou
Copy link
Author

Please provide some generated images for comparison between base model, quantized model and your variation of quantization. If there is no/minimal visible quality loss for same seed, this could save bandwidth in certain scenarios. Thank you for your contribution.

Thank you for your response. I frequently use deepcompressor to quantize my custom models, and I have already validated the effectiveness of this change on them. However, I agree that demonstrating this using standard Flux would be more convincing. This will likely take me about a day to complete.

@lantudou
Copy link
Author

lantudou commented Nov 28, 2025

import torch
from diffusers import FluxPipeline

from nunchaku import NunchakuFluxTransformer2dModel
from nunchaku.utils import get_precision

precision = get_precision()  # auto-detect your precision is 'int4' or 'fp4' based on your GPU
transformer = NunchakuFluxTransformer2dModel.from_pretrained(
    f"nunchaku-tech/nunchaku-flux.1-dev/svdq-{precision}_r32-flux.1-dev.safetensors"
)
# transformer = NunchakuFluxTransformer2dModel.from_pretrained(
#     "flux-test"
# )
generator = torch.Generator(device="cpu").manual_seed(42)
pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16
).to("cuda")
image = pipeline("A cat holding a sign that says hello world", num_inference_steps=50, guidance_scale=3.5, generator=generator).images[0]
image.save(f"flux.1-dev-{precision}.png")

Here is the nunchaku offical flux svdq-in4_r32-flux.1-dev.safetensors:
flux 1-dev-int4
and my variation of quantization model result:
flux 1-dev-int4-test
@otherV is that ok for you?

@lantudou
Copy link
Author

Here is the quantization config file and logs:
run-251127.193211.log
config-251127.193211.yaml

@otherV
Copy link

otherV commented Nov 28, 2025

@synxlin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants