optimization of perKVhead quantization #4161

Aya-ZIbra · 2025-05-20T19:12:56Z

Summary:
y-sq noticed that for prefill chunk of 64k, the improvement in attention kernel runtime for local layers is cancelled out by around 0.4 ms overhead from the quantization kernel.

https://docs.google.com/document/d/193GL7o5GMlpVlwEDVxqoDB6O85zDuS8A5-PUOSsZc1s/edit?tab=t.0#bookmark=id.zh92spta1uxw

Before:
BS =1 , Seqlen = 64k
Elapsed Cycles cycle 530,268
Memory Throughput % 17.24
Duration us 392.93

After:
----------------------- ----------- ------------
DRAM Frequency Ghz 1.59
SM Frequency Ghz 1.34
Elapsed Cycles cycle 192,884
Memory Throughput % 46.01
DRAM Throughput % 46.01
Duration us 143.23
L1/TEX Cache Throughput % 15.15
L2 Cache Throughput % 39.31
SM Active Cycles cycle 181,953.16
Compute (SM) Throughput % 71.92
----------------------- ----------- ------------

Reviewed By: y-sq

Differential Revision: D74924275

netlify · 2025-05-20T19:13:00Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`6005e07`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682d4a0bbfafd60008706593
😎 Deploy Preview	https://deploy-preview-4161--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-05-20T19:13:09Z

This pull request was exported from Phabricator. Differential Revision: D74924275

Summary: as title. This is needed to handle this case: https://www.internalfb.com/diff/D73833204?dst_version_fbid=9500286030082255&transaction_fbid=676020828512263 This will help avoid amax calc in rope for decode and partial prefill batch lanes. Also, we can rely on it in Kernel2, to return back and avoid unneccessary quantization. Differential Revision: D73478483 Reviewed By: y-sq

Summary: X-link: facebookresearch/FBGEMM#1241 y-sq noticed that for prefill chunk of 64k, the improvement in attention kernel runtime for local layers is cancelled out by around 0.4 ms overhead from the quantization kernel. https://docs.google.com/document/d/193GL7o5GMlpVlwEDVxqoDB6O85zDuS8A5-PUOSsZc1s/edit?tab=t.0#bookmark=id.zh92spta1uxw Before: BS =1 , Seqlen = 64k Elapsed Cycles cycle 530,268 Memory Throughput % 17.24 Duration us 392.93 After: ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 192,884 Memory Throughput % 46.01 DRAM Throughput % 46.01 Duration us 143.23 L1/TEX Cache Throughput % 15.15 L2 Cache Throughput % 39.31 SM Active Cycles cycle 181,953.16 Compute (SM) Throughput % 71.92 ----------------------- ----------- ------------ Reviewed By: y-sq Differential Revision: D74924275

facebook-github-bot · 2025-05-21T03:29:12Z

This pull request was exported from Phabricator. Differential Revision: D74924275

Summary: Pull Request resolved: pytorch#4161 X-link: facebookresearch/FBGEMM#1241 y-sq noticed that for prefill chunk of 64k, the improvement in attention kernel runtime for local layers is cancelled out by around 0.4 ms overhead from the quantization kernel. https://docs.google.com/document/d/193GL7o5GMlpVlwEDVxqoDB6O85zDuS8A5-PUOSsZc1s/edit?tab=t.0#bookmark=id.zh92spta1uxw Before: BS =1 , Seqlen = 64k Elapsed Cycles cycle 530,268 Memory Throughput % 17.24 Duration us 392.93 After: ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 192,884 Memory Throughput % 46.01 DRAM Throughput % 46.01 Duration us 143.23 L1/TEX Cache Throughput % 15.15 L2 Cache Throughput % 39.31 SM Active Cycles cycle 181,953.16 Compute (SM) Throughput % 71.92 ----------------------- ----------- ------------ Reviewed By: y-sq Differential Revision: D74924275

facebook-github-bot · 2025-05-21T03:35:28Z

This pull request was exported from Phabricator. Differential Revision: D74924275

facebook-github-bot · 2025-05-23T18:08:51Z

This pull request has been merged in aa2fe3d.

facebook-github-bot added the cla signed label May 20, 2025

facebook-github-bot added the fb-exported label May 20, 2025

Aya-ZIbra force-pushed the export-D74924275 branch from 5182026 to 8e14af4 Compare May 21, 2025 03:26

Aya-ZIbra force-pushed the export-D74924275 branch from 8e14af4 to 2779dd7 Compare May 21, 2025 03:26

Aya-ZIbra force-pushed the export-D74924275 branch from 2779dd7 to f2a130f Compare May 21, 2025 03:29

Aya-ZIbra force-pushed the export-D74924275 branch from f2a130f to 6005e07 Compare May 21, 2025 03:35

facebook-github-bot closed this in aa2fe3d May 23, 2025

facebook-github-bot added the Merged label May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimization of perKVhead quantization #4161

optimization of perKVhead quantization #4161

Uh oh!

Aya-ZIbra commented May 20, 2025

Uh oh!

netlify bot commented May 20, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 21, 2025

Uh oh!

facebook-github-bot commented May 21, 2025

Uh oh!

facebook-github-bot commented May 23, 2025

Uh oh!

Uh oh!

optimization of perKVhead quantization #4161

optimization of perKVhead quantization #4161

Uh oh!

Conversation

Aya-ZIbra commented May 20, 2025

Uh oh!

netlify bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 21, 2025

Uh oh!

facebook-github-bot commented May 21, 2025

Uh oh!

facebook-github-bot commented May 23, 2025

Uh oh!

Uh oh!

netlify bot commented May 20, 2025 •

edited

Loading