Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release and allocate VGPR resoures in tail loop. #1586

Merged
merged 1 commit into from
Feb 20, 2025
Merged

Conversation

hcman2
Copy link
Contributor

@hcman2 hcman2 commented Jan 23, 2025

  1. Add a VGPR base index definition.
  2. Rearange VGPR index order for further optimization.
  3. Re-allocate VGPR for tail loop.

Fix 6 potential bugs:

  1. DTV will use the Valu VGPRs which is released in the beginning of tail loop.
  2. BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop.
  3. _replaceActBranchLabel() always replaces label without postfix.
    However we should check the label we'd really like to replace with.
  4. For DTL, numVgprG2LAllocated is not set so that it will be
    default=-1.
  5. Fix G2LA vgpr allocation bug for navi3x
  6. Fix LocalWrite vgpr checkout and index bug.

@hcman2 hcman2 changed the title Release and allocate VGPR resoures in tail loop. [Draft] Release and allocate VGPR resoures in tail loop. Jan 23, 2025
@hcman2 hcman2 force-pushed the vgpr branch 2 times, most recently from 4e58993 to a053a82 Compare January 24, 2025 06:05
@hcman2 hcman2 force-pushed the vgpr branch 2 times, most recently from 05127bf to 33928c3 Compare February 3, 2025 08:40
@solaslin solaslin added the noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. label Feb 5, 2025
@hcman2 hcman2 force-pushed the vgpr branch 6 times, most recently from 7f6c281 to 4f7c8cd Compare February 10, 2025 06:01
@hcman2 hcman2 added noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. and removed noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. labels Feb 10, 2025
@hcman2 hcman2 changed the title [Draft] Release and allocate VGPR resoures in tail loop. Release and allocate VGPR resoures in tail loop. Feb 10, 2025
@hcman2 hcman2 force-pushed the vgpr branch 3 times, most recently from 7fb8dc8 to ffa0a78 Compare February 12, 2025 02:40
@hcman2
Copy link
Contributor Author

hcman2 commented Feb 12, 2025

[----------] Global test environment tear-down
[==========] 53119 tests from 13 test suites ran. (2655484 ms total)
[ PASSED ] 53119 tests.

@hcman2 hcman2 force-pushed the vgpr branch 2 times, most recently from b9338c4 to 4136665 Compare February 12, 2025 09:41
@hcman2 hcman2 force-pushed the vgpr branch 6 times, most recently from aa61bb4 to 5bd0337 Compare February 14, 2025 03:18
@hcman2
Copy link
Contributor Author

hcman2 commented Feb 14, 2025

================== 9 passed, 112 skipped in 576.73s (0:09:36) ==================
gfx1100 tox passed in this version.
�[33m=========== �[32m86 passed�[0m, �[33m�[1m36 skipped�[0m, �[33m�[1m3 warnings�[0m�[33m in 5284.64s (1:28:04)�[0m�[33m ============�[0m
gfx942 tox passed in this version.

@hcman2 hcman2 force-pushed the vgpr branch 4 times, most recently from 0504efd to 5b8573e Compare February 18, 2025 02:21
@hcman2
Copy link
Contributor Author

hcman2 commented Feb 18, 2025

image The idea to rearrange the VGPR is in this figure. Put the MISC (addr and offset, etc) in the beginning of VGPR allocation. The second is the VALU VGPRs. The Third is the VG2L VGPRs. If the begining of VALU is not 2-alligned, move the LocalReadAddressA/B here to make VALU 2-aligned. And the final vgpr is still the vgprSerial.

self.states.lastValuAB - self.states.a.startVgprValu, "ValuAB") # Add as available
module.addComment1("Tail: add ValuA/B vgpr buffer [%u...%u) to pool" % \
(self.states.a.startVgprValu, self.states.a.startVgprValu+self.states.lastValuAB))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for DTV will use the Valu VGPRs which is released in the beginning of tail loop. @Serge45 @solaslin Please check this part of tail loop because you are engaging to Swizzle+DTV.

# Check out VGPR for G2l
moduleMacroG2lVgpr, vgprG2L = self.tailLoopAllocG2LVgpr(kernel)
module.add(moduleMacroG2lVgpr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for DTV will use the Valu VGPRs which is released in the beginning of tail loop. @Serge45 @solaslin Please check this part of tail loop because you are engaging to Swizzle+DTV. I move the tail loop VGPR allocation for DTV here.

if isOptNLL:
vbegin = self.states.startVgprMisc
vsize = self.states.lastVgprForReads - vbegin
if self.states.a.startVgprLocalReadAddr > self.states.lastVgprForReads:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OptNLL, we can release all of resources.

self.vgprPool.add(vbegin, vsize, "endSummation")
module.addComment0("endSummation: add vgpr [%u...%u) to pool" % \
(vbegin, vbegin+vsize))
vbegin = self.states.bias.startVgprValu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop.


# GlobalRead, LocalWrite, LocalRead, G2L can be reclaimed, extend the "lastVgprForReads" value
if kernel["PrefetchGlobalRead"]:
self.states.lastVgprForReads = vgprIdx
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop. It is a part of lastVgprForReads so that it may be temporally add back to the pool and used before StoreToLds. @KKyang

@hcman2
Copy link
Contributor Author

hcman2 commented Feb 19, 2025

The 8e1801b passed CI. The 5b85731 is only for remove an incorrect comment.

@@ -67,7 +67,11 @@ def _replaceActBranchLabel(module, labels):
for item in module.items():
if isinstance(item, Module):
if "InsertActFunctionCallAddrCalc" in item.name:
labelLeft = labels[1:]
labelFirst = labels[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for _replaceActBranchLabel() always replaces label without postfix.
However we should check the label we'd really like to replace with
. @KKyang

tpA = self.states.bpr if bpeMax * vwa < self.states.bpr else bpeMax * vwa
tpALocal = self.states.bpr if tensorParametersA["bpe"] * vwa < self.states.bpr else tensorParametersA["bpe"] * vwa
# This check is to reserve porential usage of VGPRs for gfx12 8-bit code gen
# We should optimize the usage for better performance.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the gfx12 may have potential VGPR usage issue. @cmingch
In general this condition should not include HasWMMA_V1.

localWriteCVTCode.add(VCvtF16toF32(dst=vgpr(vgprTmp+f16Tobf16Idx), src=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx))))
localWriteCVTCode.add(VCvtF16toF32(dst=vgpr(vgprTmp+1+f16Tobf16Idx), src=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx)),sdwa=sdwa))
localWriteCVTCode.add(VPackF16toB32(dst=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx)), src0=vgpr(vgprTmp+f16Tobf16Idx), src1=vgpr(vgprTmp+1+f16Tobf16Idx),
localWriteCVTCode.add(VCvtF16toF32(dst=vgpr(vgprTmp), src=vgpr(destVgprPrefix + "+%u"%(f16Tobf16Idx))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for Fix LocalWrite vgpr checkout and index bug.

1. Add a VGPR base index definition.
2. Rearange VGPR index order for further optimization.
3. Re-allocate VGPR for tail loop.

Fix 6 potential bugs:
1. DTV will use the Valu VGPRs which is released in the beginning of tail loop.
2. BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop.
3. _replaceActBranchLabel() always replaces label without postfix.
   However we should check the label we'd really like to replace with.
4. For DTL, numVgprG2LAllocated is not set so that it will be
   default=-1.
5. Fix G2LA vgpr allocation bug for navi3x.
6. Fix LocalWrite vgpr index bug.

This change is the first step to optimize the VGPR usage in unroll loop.
In general, the VGPRs usage in the unrolled loop is dependent from the
tail. In tail, the VGPR can be used more effectively.
@hcman2 hcman2 merged commit 9f3ea97 into ROCm:develop Feb 20, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants