Release and allocate VGPR resoures in tail loop. #1586

hcman2 · 2025-01-23T06:42:39Z

Add a VGPR base index definition.
Rearange VGPR index order for further optimization.
Re-allocate VGPR for tail loop.

Fix 6 potential bugs:

DTV will use the Valu VGPRs which is released in the beginning of tail loop.
BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop.
_replaceActBranchLabel() always replaces label without postfix.
However we should check the label we'd really like to replace with.
For DTL, numVgprG2LAllocated is not set so that it will be
default=-1.
Fix G2LA vgpr allocation bug for navi3x
Fix LocalWrite vgpr checkout and index bug.

hcman2 · 2025-02-12T05:53:24Z

[----------] Global test environment tear-down
[==========] 53119 tests from 13 test suites ran. (2655484 ms total)
[ PASSED ] 53119 tests.

hcman2 · 2025-02-14T06:24:35Z

================== 9 passed, 112 skipped in 576.73s (0:09:36) ==================
gfx1100 tox passed in this version.
�[33m=========== �[32m86 passed�[0m, �[33m�[1m36 skipped�[0m, �[33m�[1m3 warnings�[0m�[33m in 5284.64s (1:28:04)�[0m�[33m ============�[0m
gfx942 tox passed in this version.

hcman2 · 2025-02-18T06:53:20Z

The idea to rearrange the VGPR is in this figure. Put the MISC (addr and offset, etc) in the beginning of VGPR allocation. The second is the VALU VGPRs. The Third is the VG2L VGPRs. If the begining of VALU is not 2-alligned, move the LocalReadAddressA/B here to make VALU 2-aligned. And the final vgpr is still the vgprSerial.

hcman2 · 2025-02-19T01:12:25Z

tensilelite/Tensile/KernelWriter.py

-          self.states.lastValuAB - self.states.a.startVgprValu, "ValuAB") # Add as available
-        module.addComment1("Tail: add ValuA/B vgpr buffer [%u...%u) to pool" % \
-                          (self.states.a.startVgprValu, self.states.a.startVgprValu+self.states.lastValuAB))
-


This is for DTV will use the Valu VGPRs which is released in the beginning of tail loop. @Serge45 @solaslin Please check this part of tail loop because you are engaging to Swizzle+DTV.

hcman2 · 2025-02-19T01:13:02Z

tensilelite/Tensile/KernelWriter.py

+      # Check out VGPR for G2l
+      moduleMacroG2lVgpr, vgprG2L = self.tailLoopAllocG2LVgpr(kernel)
+      module.add(moduleMacroG2lVgpr)
+


This is for DTV will use the Valu VGPRs which is released in the beginning of tail loop. @Serge45 @solaslin Please check this part of tail loop because you are engaging to Swizzle+DTV. I move the tail loop VGPR allocation for DTV here.

hcman2 · 2025-02-19T01:15:32Z

tensilelite/Tensile/KernelWriterAssembly.py

+    if isOptNLL:
+      vbegin = self.states.startVgprMisc
+      vsize = self.states.lastVgprForReads - vbegin
+      if self.states.a.startVgprLocalReadAddr > self.states.lastVgprForReads:


In OptNLL, we can release all of resources.

hcman2 · 2025-02-19T01:19:14Z

tensilelite/Tensile/KernelWriterAssembly.py

-    self.vgprPool.add(vbegin, vsize, "endSummation")
-    module.addComment0("endSummation: add vgpr [%u...%u) to pool" % \
-                      (vbegin, vbegin+vsize))
+      vbegin = self.states.bias.startVgprValu


This is for BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop.

hcman2 · 2025-02-19T01:22:46Z

tensilelite/Tensile/KernelWriter.py


-    # GlobalRead, LocalWrite, LocalRead, G2L can be reclaimed, extend the "lastVgprForReads" value
-    if kernel["PrefetchGlobalRead"]:
-      self.states.lastVgprForReads = vgprIdx


This is for BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop. It is a part of lastVgprForReads so that it may be temporally add back to the pool and used before StoreToLds. @KKyang

hcman2 · 2025-02-19T01:29:03Z

The 8e1801b passed CI. The 5b85731 is only for remove an incorrect comment.

hcman2 · 2025-02-19T01:29:51Z

tensilelite/Tensile/TensilePass.py

@@ -67,7 +67,11 @@ def _replaceActBranchLabel(module, labels):
    for item in module.items():
        if isinstance(item, Module):
            if "InsertActFunctionCallAddrCalc" in item.name:
-                labelLeft = labels[1:]
+                labelFirst = labels[0]


This is for _replaceActBranchLabel() always replaces label without postfix.
However we should check the label we'd really like to replace with. @KKyang

hcman2 · 2025-02-19T01:32:12Z

tensilelite/Tensile/KernelWriter.py

-        tpA      = self.states.bpr if bpeMax * vwa < self.states.bpr else bpeMax * vwa
-        tpALocal = self.states.bpr if tensorParametersA["bpe"] * vwa < self.states.bpr else tensorParametersA["bpe"] * vwa
+        # This check is to reserve porential usage of VGPRs for gfx12 8-bit code gen
+        # We should optimize the usage for better performance.


Note that the gfx12 may have potential VGPR usage issue. @cmingch
In general this condition should not include HasWMMA_V1.

hcman2 · 2025-02-19T01:33:05Z

tensilelite/Tensile/KernelWriterAssembly.py

-                  localWriteCVTCode.add(VCvtF16toF32(dst=vgpr(vgprTmp+f16Tobf16Idx), src=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx))))
-                  localWriteCVTCode.add(VCvtF16toF32(dst=vgpr(vgprTmp+1+f16Tobf16Idx), src=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx)),sdwa=sdwa))
-                  localWriteCVTCode.add(VPackF16toB32(dst=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx)), src0=vgpr(vgprTmp+f16Tobf16Idx), src1=vgpr(vgprTmp+1+f16Tobf16Idx),
+                  localWriteCVTCode.add(VCvtF16toF32(dst=vgpr(vgprTmp), src=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx))))


This is for Fix LocalWrite vgpr checkout and index bug.

1. Add a VGPR base index definition. 2. Rearange VGPR index order for further optimization. 3. Re-allocate VGPR for tail loop. Fix 6 potential bugs: 1. DTV will use the Valu VGPRs which is released in the beginning of tail loop. 2. BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop. 3. _replaceActBranchLabel() always replaces label without postfix. However we should check the label we'd really like to replace with. 4. For DTL, numVgprG2LAllocated is not set so that it will be default=-1. 5. Fix G2LA vgpr allocation bug for navi3x. 6. Fix LocalWrite vgpr index bug. This change is the first step to optimize the VGPR usage in unroll loop. In general, the VGPRs usage in the unrolled loop is dependent from the tail. In tail, the VGPR can be used more effectively.

hcman2 requested review from jichangjichang, KKyang, vin-huang, imcarsonliao, Serge45, Jinp800125, TonyYHsieh and solaslin as code owners January 23, 2025 06:42

hcman2 changed the title ~~Release and allocate VGPR resoures in tail loop.~~ [Draft] Release and allocate VGPR resoures in tail loop. Jan 23, 2025

hcman2 force-pushed the vgpr branch 2 times, most recently from 4e58993 to a053a82 Compare January 24, 2025 06:05

hcman2 force-pushed the vgpr branch 2 times, most recently from 05127bf to 33928c3 Compare February 3, 2025 08:40

solaslin added the noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. label Feb 5, 2025

hcman2 force-pushed the vgpr branch 6 times, most recently from 7f6c281 to 4f7c8cd Compare February 10, 2025 06:01

hcman2 added noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. and removed noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. labels Feb 10, 2025

hcman2 changed the title ~~[Draft] Release and allocate VGPR resoures in tail loop.~~ Release and allocate VGPR resoures in tail loop. Feb 10, 2025

hcman2 force-pushed the vgpr branch 3 times, most recently from 7fb8dc8 to ffa0a78 Compare February 12, 2025 02:40

hcman2 force-pushed the vgpr branch 2 times, most recently from b9338c4 to 4136665 Compare February 12, 2025 09:41

hcman2 force-pushed the vgpr branch 6 times, most recently from aa61bb4 to 5bd0337 Compare February 14, 2025 03:18

hcman2 force-pushed the vgpr branch 4 times, most recently from 0504efd to 5b8573e Compare February 18, 2025 02:21

hcman2 commented Feb 19, 2025

View reviewed changes

hcman2 force-pushed the vgpr branch from 5b8573e to 8e1801b Compare February 19, 2025 01:27

hcman2 commented Feb 19, 2025

View reviewed changes

hcman2 force-pushed the vgpr branch from 8e1801b to 013758c Compare February 19, 2025 01:50

KKyang approved these changes Feb 20, 2025

View reviewed changes

ssuyuanchang approved these changes Feb 20, 2025

View reviewed changes

Serge45 approved these changes Feb 20, 2025

View reviewed changes

hcman2 merged commit 9f3ea97 into ROCm:develop Feb 20, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release and allocate VGPR resoures in tail loop. #1586

Release and allocate VGPR resoures in tail loop. #1586

hcman2 commented Jan 23, 2025 •

edited

Loading

hcman2 commented Feb 12, 2025

hcman2 commented Feb 14, 2025

hcman2 commented Feb 18, 2025

hcman2 Feb 19, 2025

hcman2 Feb 19, 2025

hcman2 Feb 19, 2025

hcman2 Feb 19, 2025

hcman2 Feb 19, 2025

hcman2 commented Feb 19, 2025

hcman2 Feb 19, 2025

hcman2 Feb 19, 2025

hcman2 Feb 19, 2025

Release and allocate VGPR resoures in tail loop. #1586

Release and allocate VGPR resoures in tail loop. #1586

Conversation

hcman2 commented Jan 23, 2025 • edited Loading

hcman2 commented Feb 12, 2025

hcman2 commented Feb 14, 2025

hcman2 commented Feb 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcman2 commented Feb 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcman2 commented Jan 23, 2025 •

edited

Loading