Skip to content

[AMDGPU] Generate waterfall for calls with SGPR(inreg) argument #146997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Shoreshen
Copy link
Contributor

Fixing issue #140780

Generate waterfall loop for call using SGPR(inreg) argument but result from divergent source (e.g. VGPR).

@Shoreshen Shoreshen requested a review from Copilot July 4, 2025 04:23
@Shoreshen Shoreshen requested review from arsenm, alex-t and shiltian July 4, 2025 04:24
@llvmbot
Copy link
Member

llvmbot commented Jul 4, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (Shoreshen)

Changes

Fixing issue #140780

Generate waterfall loop for call using SGPR(inreg) argument but result from divergent source (e.g. VGPR).


Patch is 48.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146997.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp (+82)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+42-25)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+11)
  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+44)
  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.h (+7)
  • (modified) llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll (+433-3)
  • (modified) llvm/test/CodeGen/AMDGPU/tail-call-inreg-arguments.error.ll (+203-23)
diff --git a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
index 1bf5b4a241780..b929c4e7f70e2 100644
--- a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
@@ -124,6 +124,13 @@ class SIFixSGPRCopies {
   SmallVector<MachineInstr*, 4> RegSequences;
   SmallVector<MachineInstr*, 4> PHINodes;
   SmallVector<MachineInstr*, 4> S2VCopies;
+  struct V2PysSCopyInfo {
+    bool CanConvert;
+    SmallVector<MachineOperand *> MOs;
+    SmallVector<Register> SGPRs;
+  };
+  DenseMap<MachineInstr *, struct V2PysSCopyInfo> WaterFalls;
+  DenseMap<MachineInstr *, bool> V2SCanErase;
   unsigned NextVGPRToSGPRCopyID = 0;
   MapVector<unsigned, V2SCopyInfo> V2SCopies;
   DenseMap<MachineInstr *, SetVector<unsigned>> SiblingPenalty;
@@ -143,6 +150,7 @@ class SIFixSGPRCopies {
   bool needToBeConvertedToVALU(V2SCopyInfo *I);
   void analyzeVGPRToSGPRCopy(MachineInstr *MI);
   void lowerVGPR2SGPRCopies(MachineFunction &MF);
+  void lowerPysicalSGPRInsts(MachineFunction &MF);
   // Handles copies which source register is:
   // 1. Physical register
   // 2. AGPR
@@ -770,6 +778,7 @@ bool SIFixSGPRCopies::run(MachineFunction &MF) {
     }
   }
 
+  lowerPysicalSGPRInsts(MF);
   lowerVGPR2SGPRCopies(MF);
   // Postprocessing
   fixSCCCopies(MF);
@@ -800,6 +809,8 @@ bool SIFixSGPRCopies::run(MachineFunction &MF) {
   PHINodes.clear();
   S2VCopies.clear();
   PHISources.clear();
+  WaterFalls.clear();
+  V2SCanErase.clear();
 
   return true;
 }
@@ -901,6 +912,37 @@ bool SIFixSGPRCopies::lowerSpecialCase(MachineInstr &MI,
                                       MI, MI.getDebugLoc())) {
       I = std::next(I);
       MI.eraseFromParent();
+    } else if (SrcReg.isVirtual() && TRI->getRegSizeInBits(SrcReg, *MRI) ==
+                                         TRI->getRegSizeInBits(DstReg, *MRI)) {
+      // COPY can be erased if all its uses can be converted to waterfall.
+      if (V2SCanErase.count(&MI) == 0)
+        V2SCanErase[&MI] = true;
+      for (auto UseMI : TRI->findRegUsesFrom(&MI, DstReg, {DstReg}, {})) {
+        // Currently, we only support waterfall on SI_CALL_ISEL.
+        if (UseMI->getOpcode() != AMDGPU::SI_CALL_ISEL) {
+          V2SCanErase[&MI] = false;
+          continue;
+        }
+        // If CALL has one pysical reg used which is not dominated by its COPY
+        // def, we cannot create waterfall on UseMI.
+        // If we cannot create waterfall on UseMI, we cannot erase COPY.
+        if (!MDT->dominates(&MI, UseMI)) {
+          WaterFalls[UseMI].CanConvert = false;
+          V2SCanErase[&MI] = false;
+          continue;
+        }
+        for (unsigned i = 0; i < UseMI->getNumOperands(); ++i) {
+          if (UseMI->getOperand(i).isReg() &&
+              UseMI->getOperand(i).getReg() == DstReg) {
+            MachineOperand *MO = &UseMI->getOperand(i);
+            MO->setReg(SrcReg);
+            if (WaterFalls.count(UseMI) == 0)
+              WaterFalls[UseMI].CanConvert = true;
+            WaterFalls[UseMI].MOs.push_back(MO);
+            WaterFalls[UseMI].SGPRs.push_back(DstReg);
+          }
+        }
+      }
     }
     return true;
   }
@@ -1128,6 +1170,46 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
   }
 }
 
+void SIFixSGPRCopies::lowerPysicalSGPRInsts(MachineFunction &MF) {
+  for (auto &Entry : WaterFalls) {
+    MachineInstr *MI = Entry.first;
+    struct V2PysSCopyInfo Info = Entry.second;
+    if (!Info.CanConvert || Info.MOs.size() == 0 ||
+        Info.SGPRs.size() != Info.MOs.size())
+      continue;
+
+    if (MI->getOpcode() == AMDGPU::SI_CALL_ISEL) {
+      // Move everything between ADJCALLSTACKUP and ADJCALLSTACKDOWN and
+      // following copies, we also need to move copies from and to physical
+      // registers into the loop block.
+      unsigned FrameSetupOpcode = TII->getCallFrameSetupOpcode();
+      unsigned FrameDestroyOpcode = TII->getCallFrameDestroyOpcode();
+
+      // Also move the copies to physical registers into the loop block
+      MachineBasicBlock &MBB = *MI->getParent();
+      MachineBasicBlock::iterator Start(MI);
+      while (Start->getOpcode() != FrameSetupOpcode)
+        --Start;
+      MachineBasicBlock::iterator End(MI);
+      while (End->getOpcode() != FrameDestroyOpcode)
+        ++End;
+
+      // Also include following copies of the return value
+      ++End;
+      while (End != MBB.end() && End->isCopy() && End->getOperand(1).isReg() &&
+             MI->definesRegister(End->getOperand(1).getReg(), TRI))
+        ++End;
+
+      llvm::loadMBUFScalarOperandsFromVGPR(*TII, *MI, Info.MOs, MDT, Start, End,
+                                           Info.SGPRs);
+    }
+
+    for (auto &Entry : V2SCanErase)
+      if (Entry.second)
+        Entry.first->eraseFromParent();
+  }
+}
+
 void SIFixSGPRCopies::fixSCCCopies(MachineFunction &MF) {
   bool IsWave32 = MF.getSubtarget<GCNSubtarget>().isWave32();
   for (MachineBasicBlock &MBB : MF) {
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2d37451eb32b9..2571da1bb8a68 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6540,13 +6540,10 @@ void SIInstrInfo::legalizeGenericOperand(MachineBasicBlock &InsertMBB,
 // Emit the actual waterfall loop, executing the wrapped instruction for each
 // unique value of \p ScalarOps across all lanes. In the best case we execute 1
 // iteration, in the worst case we execute 64 (once per lane).
-static void
-emitLoadScalarOpsFromVGPRLoop(const SIInstrInfo &TII,
-                              MachineRegisterInfo &MRI,
-                              MachineBasicBlock &LoopBB,
-                              MachineBasicBlock &BodyBB,
-                              const DebugLoc &DL,
-                              ArrayRef<MachineOperand *> ScalarOps) {
+static void emitLoadScalarOpsFromVGPRLoop(
+    const SIInstrInfo &TII, MachineRegisterInfo &MRI, MachineBasicBlock &LoopBB,
+    MachineBasicBlock &BodyBB, const DebugLoc &DL,
+    ArrayRef<MachineOperand *> ScalarOps, SmallVector<Register> PhySGPRs = {}) {
   MachineFunction &MF = *LoopBB.getParent();
   const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
   const SIRegisterInfo *TRI = ST.getRegisterInfo();
@@ -6561,7 +6558,7 @@ emitLoadScalarOpsFromVGPRLoop(const SIInstrInfo &TII,
 
   MachineBasicBlock::iterator I = LoopBB.begin();
   Register CondReg;
-
+  int Idx = 0;
   for (MachineOperand *ScalarOp : ScalarOps) {
     unsigned RegSize = TRI->getRegSizeInBits(ScalarOp->getReg(), MRI);
     unsigned NumSubRegs = RegSize / 32;
@@ -6591,7 +6588,16 @@ emitLoadScalarOpsFromVGPRLoop(const SIInstrInfo &TII,
       }
 
       // Update ScalarOp operand to use the SGPR ScalarOp.
-      ScalarOp->setReg(CurReg);
+      if (PhySGPRs.empty())
+        ScalarOp->setReg(CurReg);
+      else {
+        // Insert into the same block of use
+        BuildMI(*ScalarOp->getParent()->getParent(),
+                ScalarOp->getParent()->getIterator(), DL, TII.get(AMDGPU::COPY),
+                PhySGPRs[Idx])
+            .addReg(CurReg);
+        ScalarOp->setReg(PhySGPRs[Idx]);
+      }
       ScalarOp->setIsKill();
     } else {
       SmallVector<Register, 8> ReadlanePieces;
@@ -6660,9 +6666,18 @@ emitLoadScalarOpsFromVGPRLoop(const SIInstrInfo &TII,
       }
 
       // Update ScalarOp operand to use the SGPR ScalarOp.
-      ScalarOp->setReg(SScalarOp);
+      if (PhySGPRs.empty())
+        ScalarOp->setReg(SScalarOp);
+      else {
+        BuildMI(*ScalarOp->getParent()->getParent(),
+                ScalarOp->getParent()->getIterator(), DL, TII.get(AMDGPU::COPY),
+                PhySGPRs[Idx])
+            .addReg(SScalarOp);
+        ScalarOp->setReg(PhySGPRs[Idx]);
+      }
       ScalarOp->setIsKill();
     }
+    Idx++;
   }
 
   Register SaveExec = MRI.createVirtualRegister(BoolXExecRC);
@@ -6686,12 +6701,13 @@ emitLoadScalarOpsFromVGPRLoop(const SIInstrInfo &TII,
 // Build a waterfall loop around \p MI, replacing the VGPR \p ScalarOp register
 // with SGPRs by iterating over all unique values across all lanes.
 // Returns the loop basic block that now contains \p MI.
-static MachineBasicBlock *
-loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
-                               ArrayRef<MachineOperand *> ScalarOps,
-                               MachineDominatorTree *MDT,
-                               MachineBasicBlock::iterator Begin = nullptr,
-                               MachineBasicBlock::iterator End = nullptr) {
+MachineBasicBlock *llvm::loadMBUFScalarOperandsFromVGPR(
+    const SIInstrInfo &TII, MachineInstr &MI,
+    ArrayRef<MachineOperand *> ScalarOps, MachineDominatorTree *MDT,
+    MachineBasicBlock::iterator Begin, MachineBasicBlock::iterator End,
+    SmallVector<Register> PhySGPRs) {
+  assert((PhySGPRs.empty() || PhySGPRs.size() == ScalarOps.size()) &&
+         "Physical SGPRs must be empty or match the number of scalar operands");
   MachineBasicBlock &MBB = *MI.getParent();
   MachineFunction &MF = *MBB.getParent();
   const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
@@ -6777,7 +6793,8 @@ loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
     }
   }
 
-  emitLoadScalarOpsFromVGPRLoop(TII, MRI, *LoopBB, *BodyBB, DL, ScalarOps);
+  emitLoadScalarOpsFromVGPRLoop(TII, MRI, *LoopBB, *BodyBB, DL, ScalarOps,
+                                PhySGPRs);
 
   MachineBasicBlock::iterator First = RemainderBB->begin();
   // Restore SCC
@@ -6998,13 +7015,13 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
                                     : AMDGPU::OpName::srsrc;
     MachineOperand *SRsrc = getNamedOperand(MI, RSrcOpName);
     if (SRsrc && !RI.isSGPRClass(MRI.getRegClass(SRsrc->getReg())))
-      CreatedBB = loadMBUFScalarOperandsFromVGPR(*this, MI, {SRsrc}, MDT);
+      CreatedBB = llvm::loadMBUFScalarOperandsFromVGPR(*this, MI, {SRsrc}, MDT);
 
     AMDGPU::OpName SampOpName =
         isMIMG(MI) ? AMDGPU::OpName::ssamp : AMDGPU::OpName::samp;
     MachineOperand *SSamp = getNamedOperand(MI, SampOpName);
     if (SSamp && !RI.isSGPRClass(MRI.getRegClass(SSamp->getReg())))
-      CreatedBB = loadMBUFScalarOperandsFromVGPR(*this, MI, {SSamp}, MDT);
+      CreatedBB = llvm::loadMBUFScalarOperandsFromVGPR(*this, MI, {SSamp}, MDT);
 
     return CreatedBB;
   }
@@ -7032,8 +7049,8 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
       while (End != MBB.end() && End->isCopy() && End->getOperand(1).isReg() &&
              MI.definesRegister(End->getOperand(1).getReg(), /*TRI=*/nullptr))
         ++End;
-      CreatedBB =
-          loadMBUFScalarOperandsFromVGPR(*this, MI, {Dest}, MDT, Start, End);
+      CreatedBB = llvm::loadMBUFScalarOperandsFromVGPR(*this, MI, {Dest}, MDT,
+                                                       Start, End);
     }
   }
 
@@ -7215,11 +7232,11 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
       // Legalize a VGPR Rsrc and soffset together.
       if (!isSoffsetLegal) {
         MachineOperand *Soffset = getNamedOperand(MI, AMDGPU::OpName::soffset);
-        CreatedBB =
-            loadMBUFScalarOperandsFromVGPR(*this, MI, {Rsrc, Soffset}, MDT);
+        CreatedBB = llvm::loadMBUFScalarOperandsFromVGPR(*this, MI,
+                                                         {Rsrc, Soffset}, MDT);
         return CreatedBB;
       }
-      CreatedBB = loadMBUFScalarOperandsFromVGPR(*this, MI, {Rsrc}, MDT);
+      CreatedBB = llvm::loadMBUFScalarOperandsFromVGPR(*this, MI, {Rsrc}, MDT);
       return CreatedBB;
     }
   }
@@ -7227,7 +7244,7 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
   // Legalize a VGPR soffset.
   if (!isSoffsetLegal) {
     MachineOperand *Soffset = getNamedOperand(MI, AMDGPU::OpName::soffset);
-    CreatedBB = loadMBUFScalarOperandsFromVGPR(*this, MI, {Soffset}, MDT);
+    CreatedBB = llvm::loadMBUFScalarOperandsFromVGPR(*this, MI, {Soffset}, MDT);
     return CreatedBB;
   }
   return CreatedBB;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 01dd3c9f4119e..77b0713275834 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1554,6 +1554,17 @@ bool execMayBeModifiedBeforeAnyUse(const MachineRegisterInfo &MRI,
                                    Register VReg,
                                    const MachineInstr &DefMI);
 
+/// Build a waterfall loop around \p MI, replacing the VGPR \p ScalarOp register
+/// with SGPRs by iterating over all unique values across all lanes.
+/// Returns the loop basic block that now contains \p MI.
+MachineBasicBlock *
+loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
+                               ArrayRef<MachineOperand *> ScalarOps,
+                               MachineDominatorTree *MDT,
+                               MachineBasicBlock::iterator Begin = nullptr,
+                               MachineBasicBlock::iterator End = nullptr,
+                               SmallVector<Register> PhySGPRs = {});
+
 namespace AMDGPU {
 
   LLVM_READONLY
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 6754be1a0b619..6fc9e3a313ce4 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -4063,6 +4063,50 @@ SIRegisterInfo::getNumUsedPhysRegs(const MachineRegisterInfo &MRI,
   return 0;
 }
 
+SmallVector<MachineInstr *>
+SIRegisterInfo::findRegUsesFrom(MachineInstr *StartMI, Register TrgReg,
+                                const DenseSet<Register> &StopAtDefs,
+                                const DenseSet<unsigned> &Opcodes) const {
+  DenseSet<MachineInstr *> Visited;
+  SmallVector<MachineInstr *> Stack;
+
+  Stack.push_back(&*std::next(StartMI->getIterator()));
+
+  SmallVector<MachineInstr *> Uses;
+  while (!Stack.empty()) {
+    MachineInstr *I = Stack.back();
+    Stack.pop_back();
+    if (!Visited.insert(I).second)
+      continue;
+
+    MachineBasicBlock *MBB = I->getParent();
+    MachineBasicBlock::iterator It = I->getIterator();
+    MachineBasicBlock::iterator E = MBB->end();
+
+    bool DefFound = false;
+    while (It != E) {
+      if (It->readsRegister(TrgReg, this) && &*It != StartMI)
+        // Only add to Uses if the opcode is in the allowed set
+        if (Opcodes.empty() || Opcodes.count(It->getOpcode()))
+          Uses.push_back(&*It);
+      for (auto DefReg : StopAtDefs)
+        if (It->findRegisterDefOperand(DefReg, this)) {
+          DefFound = true;
+          break;
+        }
+      if (DefFound)
+        break;
+      It++;
+    }
+    if (DefFound)
+      continue;
+    // Push successors onto the stack to visit next.
+    for (auto *Succ : MBB->successors())
+      Stack.push_back(&*(Succ->begin()));
+  }
+  return Uses;
+}
+
 SmallVector<StringLiteral>
 SIRegisterInfo::getVRegFlagsOfReg(Register Reg,
                                   const MachineFunction &MF) const {
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
index 06a7a17b0246b..eca310d46f3e4 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
@@ -15,6 +15,7 @@
 #define LLVM_LIB_TARGET_AMDGPU_SIREGISTERINFO_H
 
 #include "llvm/ADT/BitVector.h"
+#include "llvm/ADT/DenseSet.h"
 
 #define GET_REGINFO_HEADER
 #include "AMDGPUGenRegisterInfo.inc"
@@ -490,6 +491,12 @@ class SIRegisterInfo final : public AMDGPUGenRegisterInfo {
   unsigned getNumUsedPhysRegs(const MachineRegisterInfo &MRI,
                               const TargetRegisterClass &RC) const;
 
+  // \returns list of MI uses defined physical reg by a given \p MI.
+  SmallVector<MachineInstr *>
+  findRegUsesFrom(MachineInstr *StartMI, Register TrgReg,
+                  const DenseSet<Register> &StopAtDefs,
+                  const DenseSet<unsigned> &Opcodes) const;
+
   std::optional<uint8_t> getVRegFlagValue(StringRef Name) const override {
     return Name == "WWM_REG" ? AMDGPU::VirtRegFlag::WWM_REG
                              : std::optional<uint8_t>{};
diff --git a/llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll b/llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll
index 34f4476f7fd6a..27d7f117711f6 100644
--- a/llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll
+++ b/llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll
@@ -1,22 +1,452 @@
-; RUN: not llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs=0 -filetype=null %s 2>&1 | FileCheck -enable-var-scope %s
-
-; CHECK: illegal VGPR to SGPR copy
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -enable-var-scope %s
 
 declare hidden void @external_void_func_a15i32_inreg([15 x i32] inreg) #0
 declare hidden void @external_void_func_a16i32_inreg([16 x i32] inreg) #0
 declare hidden void @external_void_func_a15i32_inreg_i32_inreg([15 x i32] inreg, i32 inreg) #0
 
 define void @test_call_external_void_func_a15i32_inreg([15 x i32] inreg %arg0) #0 {
+; CHECK-LABEL: test_call_external_void_func_a15i32_inreg:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    s_mov_b32 s40, s33
+; CHECK-NEXT:    s_mov_b32 s33, s32
+; CHECK-NEXT:    s_or_saveexec_b64 s[42:43], -1
+; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 ; 4-byte Folded Spill
+; CHECK-NEXT:    s_mov_b64 exec, s[42:43]
+; CHECK-NEXT:    v_writelane_b32 v40, s40, 32
+; CHECK-NEXT:    v_writelane_b32 v40, s30, 0
+; CHECK-NEXT:    v_writelane_b32 v40, s31, 1
+; CHECK-NEXT:    v_writelane_b32 v40, s34, 2
+; CHECK-NEXT:    v_writelane_b32 v40, s35, 3
+; CHECK-NEXT:    v_writelane_b32 v40, s36, 4
+; CHECK-NEXT:    v_writelane_b32 v40, s37, 5
+; CHECK-NEXT:    v_writelane_b32 v40, s38, 6
+; CHECK-NEXT:    v_writelane_b32 v40, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v40, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v40, s49, 9
+; CHECK-NEXT:    v_writelane_b32 v40, s50, 10
+; CHECK-NEXT:    v_writelane_b32 v40, s51, 11
+; CHECK-NEXT:    v_writelane_b32 v40, s52, 12
+; CHECK-NEXT:    v_writelane_b32 v40, s53, 13
+; CHECK-NEXT:    v_writelane_b32 v40, s54, 14
+; CHECK-NEXT:    v_writelane_b32 v40, s55, 15
+; CHECK-NEXT:    v_writelane_b32 v40, s64, 16
+; CHECK-NEXT:    v_writelane_b32 v40, s65, 17
+; CHECK-NEXT:    v_writelane_b32 v40, s66, 18
+; CHECK-NEXT:    v_writelane_b32 v40, s67, 19
+; CHECK-NEXT:    v_writelane_b32 v40, s68, 20
+; CHECK-NEXT:    v_writelane_b32 v40, s69, 21
+; CHECK-NEXT:    v_writelane_b32 v40, s70, 22
+; CHECK-NEXT:    v_writelane_b32 v40, s71, 23
+; CHECK-NEXT:    v_writelane_b32 v40, s80, 24
+; CHECK-NEXT:    v_writelane_b32 v40, s81, 25
+; CHECK-NEXT:    v_writelane_b32 v40, s82, 26
+; CHECK-NEXT:    v_writelane_b32 v40, s83, 27
+; CHECK-NEXT:    v_writelane_b32 v40, s84, 28
+; CHECK-NEXT:    v_writelane_b32 v40, s85, 29
+; CHECK-NEXT:    v_writelane_b32 v40, s86, 30
+; CHECK-NEXT:    s_mov_b32 s50, s29
+; CHECK-NEXT:    s_mov_b32 s51, s28
+; CHECK-NEXT:    s_mov_b32 s52, s27
+; CHECK-NEXT:    s_mov_b32 s53, s26
+; CHECK-NEXT:    s_mov_b32 s54, s25
+; CHECK-NEXT:    s_mov_b32 s55, s24
+; CHECK-NEXT:    s_mov_b32 s64, s23
+; CHECK-NEXT:    s_mov_b32 s65, s22
+; CHECK-NEXT:    s_mov_b32 s66, s21
+; CHECK-NEXT:    s_mov_b32 s67, s20
+; CHECK-NEXT:    s_mov_b32 s68, s19
+; CHECK-NEXT:    s_mov_b32 s69, s18
+; CHECK-NEXT:    s_mov_b32 s70, s17
+; CHECK-NEXT:    s_mov_b32 s71, s16
+; CHECK-NEXT:    s_mov_b32 s80, s15
+; CHECK-NEXT:    s_mov_b32 s81, s14
+; CHECK-NEXT:    s_mov_b32 s82, s13
+; CHECK-NEXT:    s_mov_b32 s83, s12
+; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
+; CHECK-NEXT:    s_mov_b64 s[84:85], exec
+; CHECK-NEXT:    s_addk_i32 s32, 0x400
+; CHECK-NEXT:    v_writelane_b32 v40, s87, 31
+; CHECK-NEXT:  .LBB0_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    v_readfirstlane_b32 s26, v0
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, s26, v0
+; CHECK-NEXT:    s_and_saveexec_b64 s[86:87], vcc
+; CHECK-NEXT:    s_getpc_b64 s[28:29]
+; CHECK-NEXT:    s_add_u32 s28, s28, external_void_func_a15i32_...
[truncated]

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issue #140780 by generating a waterfall loop for calls that use an SGPR (inreg) argument derived from a divergent VGPR.

  • Tests are updated to expect waterfall loops instead of illegal copy errors.
  • New APIs in SIRegisterInfo/SIInstrInfo allow passing physical SGPRs into waterfall loops.
  • The SIFixSGPRCopies pass is extended to detect and lower physical SGPR copies into waterfall loops.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
llvm/test/CodeGen/AMDGPU/tail-call-inreg-arguments.error.ll Updated RUN/ERR lines and expected waterfall checks
llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll Switched from failure to RUN+FileCheck output
llvm/lib/Target/AMDGPU/SIRegisterInfo.h/.cpp Added findRegUsesFrom API with DenseSet filters
llvm/lib/Target/AMDGPU/SIInstrInfo.h/.cpp Extended loadMBUFScalarOperandsFromVGPR signature with PhySGPRs and callers
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp Added lowerPysicalSGPRInsts to identify and lower SGPR copies into waterfall loops
Comments suppressed due to low confidence (3)

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp:127

  • The struct name V2PysSCopyInfo appears to have a typo (Pys vs Phys). Consider renaming to V2PhysSCopyInfo for clarity.
  struct V2PysSCopyInfo {

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp:1173

  • The method name lowerPysicalSGPRInsts has a typo (Pysical vs Physical). Consider renaming to lowerPhysicalSGPRInsts to match standard spelling.
void SIFixSGPRCopies::lowerPysicalSGPRInsts(MachineFunction &MF) {

llvm/lib/Target/AMDGPU/SIRegisterInfo.h:496

  • [nitpick] The new findRegUsesFrom API would benefit from a more detailed doc comment explaining the meaning of StopAtDefs and Opcodes parameters.
  findRegUsesFrom(MachineInstr *StartMI, Register TrgReg,

DenseSet<MachineInstr *> Visited;
SmallVector<MachineInstr *> Stack;

Stack.push_back(&*std::next(StartMI->getIterator()));
Copy link
Preview

Copilot AI Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If StartMI is the last instruction in its block, std::next(StartMI->getIterator()) yields end(), and dereferencing it is undefined. Add a check to avoid pushing past-the-end.

Suggested change
Stack.push_back(&*std::next(StartMI->getIterator()));
MachineBasicBlock *MBB = StartMI->getParent();
MachineBasicBlock::iterator NextIt = std::next(StartMI->getIterator());
if (NextIt != MBB->end())
Stack.push_back(&*NextIt);

Copilot uses AI. Check for mistakes.

@@ -6777,7 +6793,8 @@ loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
}
}

emitLoadScalarOpsFromVGPRLoop(TII, MRI, *LoopBB, *BodyBB, DL, ScalarOps);
emitLoadScalarOpsFromVGPRLoop(TII, MRI, *LoopBB, *BodyBB, DL, ScalarOps,
PhySGPRs);
Copy link
Preview

Copilot AI Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inserting a waterfall loop alters control flow and can affect debug info and profiling data. Ensure debug locations and profiling intrinsics are preserved or updated for accurate performance analysis.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants