Perf: Replace full SVD with torch.svd_lowrank for acceleration #111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


The code used torch.linalg.svd to compute the full singular value decomposition of the weight matrix. This proved to be computationally expensive and memory-intensive, especially since we only utilize the top rank components.
This PR replaces it with torch.svd_lowrank, utilizing a randomized algorithm to approximate the dominant singular values efficiently.
Changes:
Switched to torch.svd_lowrank for faster decomposition.
Set niter=4 and q=10 (oversampling) to achieve an optimal balance between speed and accuracy.
Adjusted tensor transposition logic to align with svd_lowrank's output format (which returns v instead of Vh).
Performance Impact In my local environment, this optimization significantly eliminates a major bottleneck during quantization:
Low-rank creation latency: Dropped from ~5s to ~100ms.
Total Runtime: The overall calibration and quantization process is approximately 6x faster.