Skip to content

Commit 5110ac4

Browse files
gvalsonGiorgi GvaliaGiorgi Gvalia
authored
[Offload] Allow CUDA Kernels to use arbitrarily large shared memory (#145963)
Previously, the user was not able to use more than 48 KB of shared memory on NVIDIA GPUs. In order to do so, setting the function attribute `CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` is required, which was not present in the code base. With this commit, we add the ability toset this attribute, allowing the user to utilize the full power of their GPU. In order to not have to reset the function attribute for each launch of the same kernel, we keep track of the maximum memory limit (as the variable `MaxDynCGroupMemLimit`) and only set the attribute if our desired amount exceeds the limit. By default, this limit is set to 48 KB. Feedback is greatly appreciated, especially around setting the new variable as mutable. I did this becuase the `launchImpl` method is const and I am not able to modify my variable otherwise. --------- Co-authored-by: Giorgi Gvalia <[email protected]> Co-authored-by: Giorgi Gvalia <[email protected]>
1 parent 3ea636e commit 5110ac4

File tree

3 files changed

+16
-0
lines changed

3 files changed

+16
-0
lines changed

offload/plugins-nextgen/cuda/dynamic_cuda/cuda.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ DLWRAP(cuDeviceGet, 2)
3131
DLWRAP(cuDeviceGetAttribute, 3)
3232
DLWRAP(cuDeviceGetCount, 1)
3333
DLWRAP(cuFuncGetAttribute, 3)
34+
DLWRAP(cuFuncSetAttribute, 3)
3435

3536
// Device info
3637
DLWRAP(cuDeviceGetName, 3)

offload/plugins-nextgen/cuda/dynamic_cuda/cuda.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,7 @@ typedef enum CUdevice_attribute_enum {
258258

259259
typedef enum CUfunction_attribute_enum {
260260
CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 0,
261+
CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES = 8,
261262
} CUfunction_attribute;
262263

263264
typedef enum CUctx_flags_enum {
@@ -295,6 +296,7 @@ CUresult cuDeviceGet(CUdevice *, int);
295296
CUresult cuDeviceGetAttribute(int *, CUdevice_attribute, CUdevice);
296297
CUresult cuDeviceGetCount(int *);
297298
CUresult cuFuncGetAttribute(int *, CUfunction_attribute, CUfunction);
299+
CUresult cuFuncSetAttribute(CUfunction, CUfunction_attribute, int);
298300

299301
// Device info
300302
CUresult cuDeviceGetName(char *, int, CUdevice);

offload/plugins-nextgen/cuda/src/rtl.cpp

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,9 @@ struct CUDAKernelTy : public GenericKernelTy {
160160
private:
161161
/// The CUDA kernel function to execute.
162162
CUfunction Func;
163+
/// The maximum amount of dynamic shared memory per thread group. By default,
164+
/// this is set to 48 KB.
165+
mutable uint32_t MaxDynCGroupMemLimit = 49152;
163166
};
164167

165168
/// Class wrapping a CUDA stream reference. These are the objects handled by the
@@ -1302,6 +1305,16 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
13021305
if (GenericDevice.getRPCServer())
13031306
GenericDevice.Plugin.getRPCServer().Thread->notify();
13041307

1308+
// In case we require more memory than the current limit.
1309+
if (MaxDynCGroupMem >= MaxDynCGroupMemLimit) {
1310+
CUresult AttrResult = cuFuncSetAttribute(
1311+
Func, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, MaxDynCGroupMem);
1312+
Plugin::check(
1313+
AttrResult,
1314+
"Error in cuLaunchKernel while setting the memory limits: %s");
1315+
MaxDynCGroupMemLimit = MaxDynCGroupMem;
1316+
}
1317+
13051318
CUresult Res = cuLaunchKernel(Func, NumBlocks[0], NumBlocks[1], NumBlocks[2],
13061319
NumThreads[0], NumThreads[1], NumThreads[2],
13071320
MaxDynCGroupMem, Stream, nullptr, Config);

0 commit comments

Comments
 (0)