Perf: Replace full SVD with torch.svd_lowrank for acceleration #111

lantudou · 2025-11-26T06:23:46Z

The code used torch.linalg.svd to compute the full singular value decomposition of the weight matrix. This proved to be computationally expensive and memory-intensive, especially since we only utilize the top rank components.

This PR replaces it with torch.svd_lowrank, utilizing a randomized algorithm to approximate the dominant singular values efficiently.

Changes:
Switched to torch.svd_lowrank for faster decomposition.
Set niter=4 and q=10 (oversampling) to achieve an optimal balance between speed and accuracy.
Adjusted tensor transposition logic to align with svd_lowrank's output format (which returns v instead of Vh).

Performance Impact In my local environment, this optimization significantly eliminates a major bottleneck during quantization:

Low-rank creation latency: Dropped from ~5s to ~100ms.
Total Runtime: The overall calibration and quantization process is approximately 6x faster.

lantudou · 2025-11-26T06:34:30Z

Upon analyzing the logs from the modified Smooth_Diffusion, I observed a minor deviation in the calculated error compared to the original full SVD method.

While this may slightly shift the optimal hyperparameters in the grid search, the computational overhead of the original full SVD was prohibitive. It previously made quantizing large models like Flux extremely time-consuming, forcing us to compromise on grid search granularity. #35

Therefore, I consider this minor difference in results to be an acceptable trade-off for the efficiency gains

otherV · 2025-11-27T07:39:11Z

Please provide some generated images for comparison between base model, quantized model and your variation of quantization. If there is no/minimal visible quality loss for same seed, this could save bandwidth in certain scenarios. Thank you for your contribution.

lantudou · 2025-11-27T08:54:05Z

Please provide some generated images for comparison between base model, quantized model and your variation of quantization. If there is no/minimal visible quality loss for same seed, this could save bandwidth in certain scenarios. Thank you for your contribution.

Thank you for your response. I frequently use deepcompressor to quantize my custom models, and I have already validated the effectiveness of this change on them. However, I agree that demonstrating this using standard Flux would be more convincing. This will likely take me about a day to complete.

lantudou · 2025-11-28T05:58:26Z

import torch
from diffusers import FluxPipeline

from nunchaku import NunchakuFluxTransformer2dModel
from nunchaku.utils import get_precision

precision = get_precision()  # auto-detect your precision is 'int4' or 'fp4' based on your GPU
transformer = NunchakuFluxTransformer2dModel.from_pretrained(
    f"nunchaku-tech/nunchaku-flux.1-dev/svdq-{precision}_r32-flux.1-dev.safetensors"
)
# transformer = NunchakuFluxTransformer2dModel.from_pretrained(
#     "flux-test"
# )
generator = torch.Generator(device="cpu").manual_seed(42)
pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16
).to("cuda")
image = pipeline("A cat holding a sign that says hello world", num_inference_steps=50, guidance_scale=3.5, generator=generator).images[0]
image.save(f"flux.1-dev-{precision}.png")

Here is the nunchaku offical flux svdq-in4_r32-flux.1-dev.safetensors:

and my variation of quantization model result:

@otherV is that ok for you?

lantudou · 2025-11-28T06:07:41Z

Here is the quantization config file and logs:
run-251127.193211.log
config-251127.193211.yaml

otherV · 2025-11-28T08:22:43Z

@synxlin

Perf: Replace full SVD with torch.svd_lowrank for acceleration

252c4bc

lantudou mentioned this pull request Nov 26, 2025

quantization process takes too long #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: Replace full SVD with torch.svd_lowrank for acceleration #111

Perf: Replace full SVD with torch.svd_lowrank for acceleration #111

Uh oh!

lantudou commented Nov 26, 2025

Uh oh!

lantudou commented Nov 26, 2025 •

edited

Loading

Uh oh!

otherV commented Nov 27, 2025

Uh oh!

lantudou commented Nov 27, 2025

Uh oh!

lantudou commented Nov 28, 2025 •

edited

Loading

Uh oh!

lantudou commented Nov 28, 2025

Uh oh!

otherV commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Perf: Replace full SVD with torch.svd_lowrank for acceleration #111

Are you sure you want to change the base?

Perf: Replace full SVD with torch.svd_lowrank for acceleration #111

Uh oh!

Conversation

lantudou commented Nov 26, 2025

Uh oh!

lantudou commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

otherV commented Nov 27, 2025

Uh oh!

lantudou commented Nov 27, 2025

Uh oh!

lantudou commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lantudou commented Nov 28, 2025

Uh oh!

otherV commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lantudou commented Nov 26, 2025 •

edited

Loading

lantudou commented Nov 28, 2025 •

edited

Loading