Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for cl_mem_device_address_EXT #742

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pvelesko
Copy link

still WIP, sample works but test fails though more tests fail even on main

╭─pvelesko@cupcake ~/clvk/test-dev-buffer/build ‹deviceAddr●› 
╰─$ ./device_ptr_test
Platform 0 has 4 device(s):
        0: Intel(R) Graphics (RPL-S)
        1: AMD Radeon VII (RADV VEGA20)
        2: Intel(R) Arc(tm) A770 Graphics (DG2)
        3: llvmpipe (LLVM 15.0.7, 256 bits)
Select platform index: 0
Select device index: 0

Running kernel test with device address extension...
Device supports cl_ext_buffer_device_address extension
Successfully obtained device pointer: 0xfffffffefff80000
Computation successful
All tests completed successfully

Copy link
Owner

@kpet kpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! I have had a first look at this change. I have also made a few comments on the proposed extension specification, which I don't think we can really consider it settled/stable yet. Happy to keep iterating on both.



#endif // cl_ext_buffer_device_address

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have discussed releasing this extension with the Khronos OpenCL working group. The authors seem motivated to make it happen. In the meantime, I suggest you move all definitions that will be provided by the headers to src/cl_headers.hpp.

@@ -303,6 +328,7 @@ static const std::unordered_map<std::string, void*> gExtensionEntrypoints = {
EXTENSION_ENTRYPOINT(clCreateCommandQueueWithPropertiesKHR),
EXTENSION_ENTRYPOINT(clGetKernelSuggestedLocalWorkSizeKHR),
{"clGetKernelSubGroupInfoKHR", FUNC_PTR(clGetKernelSubGroupInfo)},
{"clSetKernelArgDevicePointerEXT", FUNC_PTR(clSetKernelArgDevicePointerEXT_fn)},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"clSetKernelArgDevicePointerEXT", FUNC_PTR(clSetKernelArgDevicePointerEXT_fn)},
{"clSetKernelArgDevicePointerEXT", FUNC_PTR(clSetKernelArgDevicePointerEXT)},

For consistency with other extensions. You could then just use EXTENSION_ENTRYPOINT I think.

ret = CL_INVALID_OPERATION;
break;
}
val_sizet = buffer->device_address();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec defines a new type that aliases cl_ulong which, contrary to size_t is guaranteed to have a size of 64 bits.

#endif

typedef cl_ulong cl_mem_device_address_EXT;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these definitions to cl_headers.hpp for now.

@@ -47,6 +47,9 @@ struct cvk_platform;

struct cvk_device : public _cl_device_id,
object_magic_header<object_magic::device> {
/// Map for storing device pointers to buffer pointers
/// Support cl_ext_buffer_device_address
std::unordered_map<void*, void*> device_to_buffer_map;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those pointers would only be valid for a given context. I think we probably want to keep this state in cvk_context. Please use the specific types instead of void*.

@@ -626,6 +626,7 @@ void cvk_device::build_extension_ils_list() {
// Add always supported extensions
MAKE_NAME_VERSION(1, 0, 0, "cl_khr_extended_versioning"),
MAKE_NAME_VERSION(1, 0, 0, "cl_khr_create_command_queue"),
MAKE_NAME_VERSION(1, 0, 0, "cl_ext_buffer_device_address"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current spec drafts does not define 1.0.0, only 0.1.0, 0.2.0, and 0.3.0.


std::cout << "All tests completed successfully\n";
return 0;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the intention is to move all the test code to tests/api/buffer_device_address.cpp, isn't it?

// device pointer found in map, swapping the buffer pointer
auto buffer_ptr_raw = it->second;
auto buffer_ptr = reinterpret_cast<cvk_buffer*>(buffer_ptr_raw);
m_kernel_resources[arg.binding] = buffer_ptr;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break the existing argument setting for buffers. It might be easier/cleaner to introduce a new cvk_kernel::set_arg_device_address function (that would have to also mark the argument as set as done by line 361).

Comment on lines +29 to +41
// First check if device supports the extension
size_t ext_size;
GetDeviceInfo(CL_DEVICE_EXTENSIONS, 0, nullptr, &ext_size);

std::vector<char> extensions(ext_size);
GetDeviceInfo(CL_DEVICE_EXTENSIONS, ext_size, extensions.data(), nullptr);

bool hasBufferDeviceAddress =
std::string(extensions.data()).find("cl_ext_buffer_device_address") != std::string::npos;

if (!hasBufferDeviceAddress) {
GTEST_SKIP() << "Device does not support cl_ext_buffer_device_address extension";
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// First check if device supports the extension
size_t ext_size;
GetDeviceInfo(CL_DEVICE_EXTENSIONS, 0, nullptr, &ext_size);
std::vector<char> extensions(ext_size);
GetDeviceInfo(CL_DEVICE_EXTENSIONS, ext_size, extensions.data(), nullptr);
bool hasBufferDeviceAddress =
std::string(extensions.data()).find("cl_ext_buffer_device_address") != std::string::npos;
if (!hasBufferDeviceAddress) {
GTEST_SKIP() << "Device does not support cl_ext_buffer_device_address extension";
}
// First check if device supports the extension
REQUIRE_EXTENSION("cl_ext_buffer_device_address");

This is being introduced by #748 for another test.

@pvelesko
Copy link
Author

pvelesko commented Dec 4, 2024

@kpet thank you for the review, I'll work on addressing your comments. Meanwhile, this is what I get when I try to execute an example:

warning: overriding the module target triple with spir-unknown-unknown
warning: Linking two modules of different data layouts: 'Unknown buffer' is 'e-p:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-G1' whereas 'clvk-1ArlBH/source.bc' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-G1'

clspv: /space/pvelesko/clvk/external/clspv/third_party/llvm/llvm/lib/IR/Value.cpp:507: void llvm::Value::doRAUW(llvm::Value*, llvm::Value::ReplaceMetadataUses): Assertion `New->getType() == getType() && "replaceAllUses of value with new value of different type!"' failed.
Aborted (core dumped)

Error processing hip-spirv-58abf1.spv: Failed to build program

I'm not an LLVM expert but just wanted to check with you first before I investigate further.

@rjodinchr
Copy link
Contributor

Can you share the kernel source and the command line? Everything will be in clvk's log, that you would get running with:

CLVK_LOG=4
CLVK_LOG_DEST=file:log.txt

then upload log.txt here

@pvelesko
Copy link
Author

pvelesko commented Dec 4, 2024

The kernel source is a HIP matrix multiplication @rjodinchr
or do you mean the SPIR-V?

@rjodinchr
Copy link
Contributor

The error message you have posted is about clspv not being able to generate the SPIR-V shader from the CL kernel.
Please provide the log from clvk. It will contain everything we need to reproduce and fix the issue.
Thanks

@pvelesko
Copy link
Author

pvelesko commented Dec 4, 2024

@rjodinchr
log.txt

@rjodinchr
Copy link
Contributor

Oh, I understand what you meant by:

or do you mean the SPIR-V?

This application is using OpenCL SPIR-V source. This was not clear to me as SPIR-V has 2 variant. The OpenCL one and the Vulkan one. While the goal of clvk is to compile whatever CL source it gets to Vulkan SPIR-V, it can take OpenCL SPIR-V as the input.

But the log does not contain the OpenCL SPIR-V inputs. Could you share the sources then (I see two calls to clCraeteProgramWithIL, so both will be needed).

@pvelesko
Copy link
Author

pvelesko commented Dec 5, 2024

Hmm only one got dumped for me:

; SPIR-V
; Version: 1.1
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 145
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
               OpCapability Int64
               OpCapability Int8
          %1 = OpExtInstImport "OpenCL.std"
          %2 = OpExtInstImport "OpenCL.DebugInfo.100"
               OpMemoryModel Physical64 OpenCL
               OpEntryPoint Kernel %80 "_Z12gpuMatrixMulPKfS0_Pfjjj" %__spirv_BuiltInWorkgroupSize %__spirv_BuiltInWorkgroupId %__spirv_BuiltInLocalInvocationId
         %89 = OpString "/space/pvelesko/chipStar/main/samples/0_MatrixMultiply/MatrixMultiply.cpp"
         %93 = OpString "gpuMatrixMul"
         %94 = OpString ""
         %97 = OpString "__hip_get_block_dim_x"
         %98 = OpString "/space/pvelesko/chipStar/main/include/hip/spirv_hip.hh"
        %101 = OpString "__get_x"
        %103 = OpString "__hip_get_block_idx_x"
        %106 = OpString "__hip_get_thread_idx_x"
        %109 = OpString "__hip_get_block_dim_y"
        %111 = OpString "__get_y"
        %113 = OpString "__hip_get_block_idx_y"
        %116 = OpString "__hip_get_thread_idx_y"
               OpSource OpenCL_C 200000
               OpName %__spirv_BuiltInWorkgroupSize "__spirv_BuiltInWorkgroupSize"
               OpName %__spirv_BuiltInWorkgroupId "__spirv_BuiltInWorkgroupId"
               OpName %__spirv_BuiltInLocalInvocationId "__spirv_BuiltInLocalInvocationId"
               OpName %A_coerce "A.coerce"
               OpName %B_coerce "B.coerce"
               OpName %C_coerce "C.coerce"
               OpName %M "M"
               OpName %N "N"
               OpName %K "K"
               OpName %entry "entry"
               OpName %for_body_lr_ph "for.body.lr.ph"
               OpName %for_body "for.body"
               OpName %for_cond_cleanup_loopexit "for.cond.cleanup.loopexit"
               OpName %for_cond_cleanup "for.cond.cleanup"
               OpName %call_i36 "call.i36"
               OpName %conv_i "conv.i"
               OpName %call_i "call.i"
               OpName %conv_i37 "conv.i37"
               OpName %mul "mul"
               OpName %call_i38 "call.i38"
               OpName %conv_i39 "conv.i39"
               OpName %add "add"
               OpName %call_i40 "call.i40"
               OpName %conv_i41 "conv.i41"
               OpName %call_i42 "call.i42"
               OpName %conv_i43 "conv.i43"
               OpName %mul8 "mul8"
               OpName %call_i44 "call.i44"
               OpName %conv_i45 "conv.i45"
               OpName %add10 "add10"
               OpName %cmp46_not "cmp46.not"
               OpName %mul13 "mul13"
               OpName %add18 "add18"
               OpName %acc_0_lcssa "acc.0.lcssa"
               OpName %mul19 "mul19"
               OpName %add20 "add20"
               OpName %idxprom21 "idxprom21"
               OpName %arrayidx22 "arrayidx22"
               OpName %inc "inc"
               OpName %k_048 "k.048"
               OpName %acc_047 "acc.047"
               OpName %mul11 "mul11"
               OpName %add12 "add12"
               OpName %idxprom "idxprom"
               OpName %arrayidx "arrayidx"
               OpName %add14 "add14"
               OpName %idxprom15 "idxprom15"
               OpName %arrayidx16 "arrayidx16"
               OpName %mul17 "mul17"
               OpName %cmp "cmp"
               OpName %A_coerce_0 "A.coerce"
               OpName %B_coerce_0 "B.coerce"
               OpName %C_coerce_0 "C.coerce"
               OpName %M_0 "M"
               OpName %N_0 "N"
               OpName %K_0 "K"
               OpModuleProcessed "Debug info producer: clang version 18.1.5 (https://github.com/CHIP-SPV/llvm-project.git 5c39d7d1aa6e54a9c8df41002d419c398ec8830c)"
               OpDecorate %__spirv_BuiltInWorkgroupSize LinkageAttributes "__spirv_BuiltInWorkgroupSize" Import
               OpDecorate %__spirv_BuiltInWorkgroupSize Constant
               OpDecorate %__spirv_BuiltInWorkgroupSize BuiltIn WorkgroupSize
               OpDecorate %__spirv_BuiltInWorkgroupId LinkageAttributes "__spirv_BuiltInWorkgroupId" Import
               OpDecorate %__spirv_BuiltInWorkgroupId Constant
               OpDecorate %__spirv_BuiltInWorkgroupId BuiltIn WorkgroupId
               OpDecorate %__spirv_BuiltInLocalInvocationId LinkageAttributes "__spirv_BuiltInLocalInvocationId" Import
               OpDecorate %__spirv_BuiltInLocalInvocationId Constant
               OpDecorate %__spirv_BuiltInLocalInvocationId BuiltIn LocalInvocationId
               OpDecorate %A_coerce FuncParamAttr NoAlias
               OpDecorate %A_coerce FuncParamAttr NoCapture
               OpDecorate %A_coerce FuncParamAttr NoWrite
               OpDecorate %B_coerce FuncParamAttr NoAlias
               OpDecorate %B_coerce FuncParamAttr NoCapture
               OpDecorate %B_coerce FuncParamAttr NoWrite
               OpDecorate %C_coerce FuncParamAttr NoAlias
               OpDecorate %C_coerce FuncParamAttr NoCapture
               OpDecorate %A_coerce_0 FuncParamAttr NoAlias
               OpDecorate %A_coerce_0 FuncParamAttr NoCapture
               OpDecorate %A_coerce_0 FuncParamAttr NoWrite
               OpDecorate %B_coerce_0 FuncParamAttr NoAlias
               OpDecorate %B_coerce_0 FuncParamAttr NoCapture
               OpDecorate %B_coerce_0 FuncParamAttr NoWrite
               OpDecorate %C_coerce_0 FuncParamAttr NoAlias
               OpDecorate %C_coerce_0 FuncParamAttr NoCapture
      %uchar = OpTypeInt 8 0
      %ulong = OpTypeInt 64 0
       %uint = OpTypeInt 32 0
    %uchar_1 = OpConstant %uchar 1
     %uint_0 = OpConstant %uint 0
     %uint_1 = OpConstant %uint 1
%_ptr_CrossWorkgroup_uchar = OpTypePointer CrossWorkgroup %uchar
    %v3ulong = OpTypeVector %ulong 3
%_ptr_Input_v3ulong = OpTypePointer Input %v3ulong
       %void = OpTypeVoid
      %float = OpTypeFloat 32
%_ptr_CrossWorkgroup_float = OpTypePointer CrossWorkgroup %float
         %17 = OpTypeFunction %void %_ptr_CrossWorkgroup_float %_ptr_CrossWorkgroup_float %_ptr_CrossWorkgroup_float %uint %uint %uint
       %bool = OpTypeBool
%__spirv_BuiltInWorkgroupSize = OpVariable %_ptr_Input_v3ulong Input
%__spirv_BuiltInWorkgroupId = OpVariable %_ptr_Input_v3ulong Input
%__spirv_BuiltInLocalInvocationId = OpVariable %_ptr_Input_v3ulong Input
    %float_0 = OpConstant %float 0
         %90 = OpExtInst %void %2 DebugSource %89
         %91 = OpExtInst %void %2 DebugCompilationUnit 65536 4 %90 CPP_for_OpenCL
         %92 = OpExtInst %void %2 DebugTypeFunction None %void
         %95 = OpExtInst %void %2 DebugInfoNone
         %96 = OpExtInst %void %2 DebugFunction %93 %92 %90 83 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 87 %18 %95
         %99 = OpExtInst %void %2 DebugSource %98
        %100 = OpExtInst %void %2 DebugFunction %97 %92 %99 82 0 %91 %94 FlagIsLocal|FlagIsDefinition|FlagPrototyped|FlagIsOptimized 82 %95 %95
        %102 = OpExtInst %void %2 DebugFunction %101 %92 %99 108 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 108 %95 %95
        %104 = OpExtInst %void %2 DebugFunction %103 %92 %99 77 0 %91 %94 FlagIsLocal|FlagIsDefinition|FlagPrototyped|FlagIsOptimized 77 %95 %95
        %105 = OpExtInst %void %2 DebugFunction %101 %92 %99 102 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 102 %95 %95
        %107 = OpExtInst %void %2 DebugFunction %106 %92 %99 72 0 %91 %94 FlagIsLocal|FlagIsDefinition|FlagPrototyped|FlagIsOptimized 72 %95 %95
        %108 = OpExtInst %void %2 DebugFunction %101 %92 %99 96 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 96 %95 %95
        %110 = OpExtInst %void %2 DebugFunction %109 %92 %99 83 0 %91 %94 FlagIsLocal|FlagIsDefinition|FlagPrototyped|FlagIsOptimized 83 %95 %95
        %112 = OpExtInst %void %2 DebugFunction %111 %92 %99 109 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 109 %95 %95
        %114 = OpExtInst %void %2 DebugFunction %113 %92 %99 78 0 %91 %94 FlagIsLocal|FlagIsDefinition|FlagPrototyped|FlagIsOptimized 78 %95 %95
        %115 = OpExtInst %void %2 DebugFunction %111 %92 %99 103 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 103 %95 %95
        %117 = OpExtInst %void %2 DebugFunction %116 %92 %99 73 0 %91 %94 FlagIsLocal|FlagIsDefinition|FlagPrototyped|FlagIsOptimized 73 %95 %95
        %118 = OpExtInst %void %2 DebugFunction %111 %92 %99 97 0 %91 %94 FlagIsDefinition|FlagPrototyped|FlagIsOptimized 97 %95 %95
        %119 = OpExtInst %void %2 DebugInlinedAt 90 %96
        %120 = OpExtInst %void %2 DebugInlinedAt 108 %102 %119
        %122 = OpExtInst %void %2 DebugInlinedAt 90 %96
        %123 = OpExtInst %void %2 DebugInlinedAt 102 %105 %122
        %126 = OpExtInst %void %2 DebugInlinedAt 90 %96
        %127 = OpExtInst %void %2 DebugInlinedAt 96 %108 %126
        %130 = OpExtInst %void %2 DebugInlinedAt 92 %96
        %131 = OpExtInst %void %2 DebugInlinedAt 109 %112 %130
        %133 = OpExtInst %void %2 DebugInlinedAt 92 %96
        %134 = OpExtInst %void %2 DebugInlinedAt 103 %115 %133
        %137 = OpExtInst %void %2 DebugInlinedAt 92 %96
        %138 = OpExtInst %void %2 DebugInlinedAt 97 %118 %137
         %18 = OpFunction %void None %17
   %A_coerce = OpFunctionParameter %_ptr_CrossWorkgroup_float
   %B_coerce = OpFunctionParameter %_ptr_CrossWorkgroup_float
   %C_coerce = OpFunctionParameter %_ptr_CrossWorkgroup_float
          %M = OpFunctionParameter %uint
          %N = OpFunctionParameter %uint
          %K = OpFunctionParameter %uint
      %entry = OpLabel
        %121 = OpExtInst %void %2 DebugScope %100 %120
               OpLine %98 82 50
         %30 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupSize Aligned 32
   %call_i36 = OpCompositeExtract %ulong %30 0
     %conv_i = OpUConvert %uint %call_i36
        %124 = OpExtInst %void %2 DebugScope %104 %123
               OpLine %98 77 50
         %33 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupId Aligned 32
     %call_i = OpCompositeExtract %ulong %33 0
   %conv_i37 = OpUConvert %uint %call_i
        %125 = OpExtInst %void %2 DebugScope %96
               OpLine %89 90 21
        %mul = OpIMul %uint %conv_i37 %conv_i
        %128 = OpExtInst %void %2 DebugScope %107 %127
               OpLine %98 72 51
         %37 = OpLoad %v3ulong %__spirv_BuiltInLocalInvocationId Aligned 32
   %call_i38 = OpCompositeExtract %ulong %37 0
   %conv_i39 = OpUConvert %uint %call_i38
        %129 = OpExtInst %void %2 DebugScope %96
               OpLine %89 90 37
        %add = OpIAdd %uint %mul %conv_i39
        %132 = OpExtInst %void %2 DebugScope %110 %131
               OpLine %98 83 50
         %41 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupSize Aligned 32
   %call_i40 = OpCompositeExtract %ulong %41 1
   %conv_i41 = OpUConvert %uint %call_i40
        %135 = OpExtInst %void %2 DebugScope %114 %134
               OpLine %98 78 50
         %44 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupId Aligned 32
   %call_i42 = OpCompositeExtract %ulong %44 1
   %conv_i43 = OpUConvert %uint %call_i42
        %136 = OpExtInst %void %2 DebugScope %96
               OpLine %89 92 21
       %mul8 = OpIMul %uint %conv_i43 %conv_i41
        %139 = OpExtInst %void %2 DebugScope %117 %138
               OpLine %98 73 51
         %48 = OpLoad %v3ulong %__spirv_BuiltInLocalInvocationId Aligned 32
   %call_i44 = OpCompositeExtract %ulong %48 1
   %conv_i45 = OpUConvert %uint %call_i44
        %140 = OpExtInst %void %2 DebugScope %96
               OpLine %89 92 37
      %add10 = OpIAdd %uint %mul8 %conv_i45
               OpLine %89 96 22
  %cmp46_not = OpIEqual %bool %K %uint_0
               OpLine %89 96 3
               OpBranchConditional %cmp46_not %for_cond_cleanup %for_body_lr_ph
%for_body_lr_ph = OpLabel
      %mul13 = OpIMul %uint %add10 %K
        %141 = OpExtInst %void %2 DebugScope %96
               OpLine %89 96 3
               OpBranch %for_body
   %for_body = OpLabel
      %k_048 = OpPhi %uint %uint_0 %for_body_lr_ph %inc %for_body
    %acc_047 = OpPhi %float %float_0 %for_body_lr_ph %add18 %for_body
        %144 = OpExtInst %void %2 DebugScope %96
               OpLine %89 100 16
      %mul11 = OpIMul %uint %k_048 %M
               OpLine %89 100 20
      %add12 = OpIAdd %uint %mul11 %add
               OpLine %89 100 12
    %idxprom = OpUConvert %ulong %add12
   %arrayidx = OpInBoundsPtrAccessChain %_ptr_CrossWorkgroup_float %A_coerce %idxprom
         %70 = OpLoad %float %arrayidx Aligned 4
               OpLine %89 100 51
      %add14 = OpIAdd %uint %k_048 %mul13
               OpLine %89 100 35
  %idxprom15 = OpUConvert %ulong %add14
 %arrayidx16 = OpInBoundsPtrAccessChain %_ptr_CrossWorkgroup_float %B_coerce %idxprom15
         %74 = OpLoad %float %arrayidx16 Aligned 4
               OpLine %89 100 33
      %mul17 = OpFMul %float %70 %74
               OpLine %89 100 9
      %add18 = OpFAdd %float %acc_047 %mul17
               OpLine %89 96 28
        %inc = OpIAdd %uint %k_048 %uint_1
               OpLine %89 96 22
        %cmp = OpULessThan %bool %inc %K
               OpLine %89 96 3
               OpBranchConditional %cmp %for_body %for_cond_cleanup_loopexit
%for_cond_cleanup_loopexit = OpLabel
        %142 = OpExtInst %void %2 DebugScope %96
               OpLine %89 105 15
               OpBranch %for_cond_cleanup
%for_cond_cleanup = OpLabel
        %143 = OpExtInst %void %2 DebugScope %96
               OpLine %89 0 0
%acc_0_lcssa = OpPhi %float %float_0 %entry %add18 %for_cond_cleanup_loopexit
               OpLine %89 105 15
      %mul19 = OpIMul %uint %add10 %M
               OpLine %89 105 19
      %add20 = OpIAdd %uint %mul19 %add
               OpLine %89 105 3
  %idxprom21 = OpUConvert %ulong %add20
 %arrayidx22 = OpInBoundsPtrAccessChain %_ptr_CrossWorkgroup_float %C_coerce %idxprom21
               OpLine %89 105 32
               OpStore %arrayidx22 %acc_0_lcssa Aligned 4
               OpLine %89 106 1
               OpReturn
               OpFunctionEnd
         %80 = OpFunction %void None %17
 %A_coerce_0 = OpFunctionParameter %_ptr_CrossWorkgroup_float
 %B_coerce_0 = OpFunctionParameter %_ptr_CrossWorkgroup_float
 %C_coerce_0 = OpFunctionParameter %_ptr_CrossWorkgroup_float
        %M_0 = OpFunctionParameter %uint
        %N_0 = OpFunctionParameter %uint
        %K_0 = OpFunctionParameter %uint
         %87 = OpLabel
         %88 = OpFunctionCall %void %18 %A_coerce_0 %B_coerce_0 %C_coerce_0 %M_0 %N_0 %K_0
               OpReturn
               OpFunctionEnd

@rjodinchr
Copy link
Contributor

I'll try to reproduce with that one then. Thank you

@rjodinchr
Copy link
Contributor

I'm not able to reproduce.
I will need even more assets I think.
Could you run with:

CLVK_LOG=4
CLVK_LOG_DEST=file:log.txt
CLVK_KEEP_TEMPORARIES=1

That should produce a log.txt file as well as a bunch of clvk-XXXXX folders.
Could you make a archive with all of them and upload that here please?

@pvelesko
Copy link
Author

pvelesko commented Dec 5, 2024

╭─pvelesko@cupcake ~/chipStar/clvk-enable/build ‹clvk-enable› 
╰─$ ./tools/opencl-spirv-compiler/opencl-spirv-compiler ./MatrixMul.spv                                                                                       130 ↵
Build failed. Log:
warning: overriding the module target triple with spir-unknown-unknown
warning: Linking two modules of different data layouts: 'Unknown buffer' is 'e-p:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-G1' whereas 'clvk-9YdtEh/source.bc' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-G1'

clspv: /space/pvelesko/clvk/external/clspv/third_party/llvm/llvm/lib/IR/Value.cpp:507: void llvm::Value::doRAUW(llvm::Value*, llvm::Value::ReplaceMetadataUses): Assertion `New->getType() == getType() && "replaceAllUses of value with new value of different type!"' failed.
Aborted (core dumped)

Error processing ./MatrixMul.spv: Failed to build program

clvkRepro.zip

@rjodinchr
Copy link
Contributor

So the issue is that clvk does not realize that the SPIR-V source is using OpMemoryModel Physical64 OpenCL, thus it does not give the proper arguments to clspv.

We can force clvk anyway by running it with CLVK_SPIR_ARCH=spir64, but then I got the following error:

error: 91: [VUID-StandaloneSpirv-OpVariable-04651] OpVariable, <id> '4[%4]', has a disallowed initializer & storage class combination.
From Vulkan spec:
Variable declarations that include initializers must have one of the following storage classes: Output, Private, Function or Workgroup
  %4 = OpVariable %_ptr_PhysicalStorageBuffer_uchar PhysicalStorageBuffer %uchar_1

This is because we have the following line in the OpenCL SPIR-V kernel:

%__chip_module_has_no_IGBAs = OpVariable %_ptr_CrossWorkgroup_uchar CrossWorkgroup %uchar_1

Which gets translated into the following LLVM IR (by llvm-spirv):

@__chip_module_has_no_IGBAs = addrspace(1) constant i8 1

Note that it is on addrspace(1) which is the global addrspace, not the constant one (2).
I need to check, but I think clspv does not support such global variables.

Removing every trace of %__chip_module_has_no_IGBAs from MatrixMul.spv.txt, assembling back to MatrixMul.spv and then running the program succeeded locally:

$ CLVK_SPIRV_ARCH=spir64 ./build/opencl-spirv-compiler ./MatrixMul.spv
Successfully compiled ./MatrixMul.spv to MatrixMul_device.bin

MatrixMul_device.zip

@pvelesko
Copy link
Author

pvelesko commented Dec 5, 2024

__chip_module_has_no_IGBAs is used for optimization and we can generate SPIR-V without it but generally speaking we do need it in global space since we r/w to it. Is the main issue here that it has an initializer or that that it's in global space?

@pvelesko
Copy link
Author

pvelesko commented Dec 5, 2024

Disabled it, and I was able to run a HIP example on Vulkan!

╭─pvelesko@cupcake ~/chipStar/clvk-enable/build ‹clvk-enable●› 
╰─$ CLVK_SPIRV_ARCH=spir64 CHIP_LOGLEVEL=info ./samples/0_MatrixMultiply/MatrixMultiply
CHIP info [TID 418282] [1733410211.090988490] : CHIP_PLATFORM=0
CHIP info [TID 418282] [1733410211.091204631] : CHIP_DEVICE_TYPE=gpu
CHIP info [TID 418282] [1733410211.091225923] : CHIP_DEVICE=0
CHIP info [TID 418282] [1733410211.091261683] : CHIP_BE=opencl
CHIP info [TID 418282] [1733410211.091282034] : CHIP_DUMP_SPIRV=off
CHIP info [TID 418282] [1733410211.091313796] : CHIP_JIT_FLAGS_OVERRIDE=
CHIP info [TID 418282] [1733410211.091328992] : CHIP_L0_COLLECT_EVENTS_TIMEOUT=0
CHIP info [TID 418282] [1733410211.091367973] : CHIP_L0_EVENT_TIMEOUT=0
CHIP info [TID 418282] [1733410211.091383225] : CHIP_SKIP_UNINIT=off
CHIP info [TID 418282] [1733410211.091399049] : CHIP_LAZY_JIT=on
CHIP info [TID 418282] [1733410211.091413432] : CHIP_OCL_DISABLE_QUEUE_PROFILING=off
CHIP info [TID 418282] [1733410211.091429032] : CHIP_OCL_USE_ALLOC_STRATEGY=off
CHIP info [TID 418282] [1733410211.091454056] : CHIP_MODULE_CACHE_DIR=/space/pvelesko/.cache/chipStar
CHIP info [TID 418282] [1733410211.237711581] : OpenCL Devices of type gpu with SPIR-V_1 support:
Intel(R) Graphics (RPL-S)  is supported.

Device name Intel(R) Graphics (RPL-S)
CHIP info [TID 418282] [1733410211.483975250] : clProgramBuild took 0.00377717 seconds
CHIP info [TID 418282] [1733410211.484163837] : Loaded from cache, kernel compilation took 0.00411241 seconds
CHIP info [TID 418282] [1733410211.484184029] : Module compilation took 4133 microseconds
Running 1 iterations 
hipLaunchKernel 0 time taken: 122.767
hipLaunchKernel BEST TIME: 122.767
GPU real time taken(ms): 126.033
matrixMultiplyCPUReference time taken(ms): 4175.84
Verification PASSED!

@pvelesko
Copy link
Author

pvelesko commented Dec 5, 2024

╭─pvelesko@cupcake ~/chipStar/clvk-enable/build ‹clvk-enable●› 
╰─$ ctest -R cuda --output-on-failure                                                                                                                                                                                                         8 ↵
Test project /space/pvelesko/chipStar/clvk-enable/build
      Start 1285: cucc-include-cuda-runtime-twice
 1/28 Test #1285: cucc-include-cuda-runtime-twice ...   Passed    0.19 sec
      Start 1386: cuda-asyncAPI
 2/28 Test #1386: cuda-asyncAPI .....................   Passed    0.73 sec
      Start 1387: cuda-lambda
 3/28 Test #1387: cuda-lambda .......................Subprocess aborted***Exception:   0.50 sec
CHIP error [TID 603321] [1733411241.287751572] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 603321] [1733411241.287990128] : Caught Error: hipErrorNotInitialized
cuda-lambda: /space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/0_Simple/lambda/lambda.cu:23: int main(): Assertion `OutH == 1' failed.

      Start 1388: cuda-matrixMul
 4/28 Test #1388: cuda-matrixMul ....................   Passed    0.69 sec
      Start 1389: cuda-template
 5/28 Test #1389: cuda-template .....................   Passed    0.57 sec
      Start 1390: cuda-vectorAdd
 6/28 Test #1390: cuda-vectorAdd ....................   Passed    0.57 sec
      Start 1391: cuda-clock
 7/28 Test #1391: cuda-clock ........................   Passed    0.91 sec
      Start 1392: cuda-cppIntegration
 8/28 Test #1392: cuda-cppIntegration ...............   Passed    0.57 sec
      Start 1393: cuda-simplePrintf
 9/28 Test #1393: cuda-simplePrintf .................   Passed    0.92 sec
      Start 1394: cuda-simpleAtomicIntrinsics
10/28 Test #1394: cuda-simpleAtomicIntrinsics .......***Failed    0.97 sec
CHIP error [TID 603918] [1733411246.073723476] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 603918] [1733411246.073966304] : Caught Error: hipErrorNotInitialized
/space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/0_Simple/simpleAtomicIntrinsics/simpleAtomicIntrinsics.cu(126) : getLastCudaError() CUDA error : Kernel execution failed : (3) hipErrorNotInitialized.
simpleAtomicIntrinsics starting...
MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

> GPU device has 1 Multi-Processors, SM 2.0 compute capabilities


      Start 1395: cuda-simpleTemplates
11/28 Test #1395: cuda-simpleTemplates ..............   Passed    0.57 sec
      Start 1396: cuda-simpleCallback
12/28 Test #1396: cuda-simpleCallback ...............   Passed    0.58 sec
      Start 1397: cuda-bandwidthTest
13/28 Test #1397: cuda-bandwidthTest ................   Passed    1.86 sec
      Start 1398: cuda-deviceQuery
14/28 Test #1398: cuda-deviceQuery ..................   Passed    0.58 sec
      Start 1399: cuda-convolutionSeparable
15/28 Test #1399: cuda-convolutionSeparable .........***Failed    3.77 sec
CHIP error [TID 604365] [1733411251.938588085] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 604365] [1733411251.938794831] : Caught Error: hipErrorNotInitialized
CHIP error [TID 604365] [1733411253.437853597] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 604365] [1733411253.437948768] : Caught Error: hipErrorNotInitialized
/space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/2_Graphics/convolutionSeparable/convolutionSeparable.cu(120) : getLastCudaError() CUDA error : convolutionRowsKernel() execution failed
 : (3) hipErrorNotInitialized.
[/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-convolutionSeparable] - Starting...
MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

Image Width x Height = 3072 x 3072

Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Running GPU convolution (16 identical iterations)...


      Start 1400: cuda-dwtHaar1D
16/28 Test #1400: cuda-dwtHaar1D ....................   Passed    0.58 sec
      Start 1401: cuda-histogram
17/28 Test #1401: cuda-histogram ....................   Passed    2.28 sec
      Start 1402: cuda-binomialoptions
18/28 Test #1402: cuda-binomialoptions ..............***Failed    1.46 sec
CHIP error [TID 604670] [1733411257.766337544] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 604670] [1733411257.766585260] : Caught Error: hipErrorNotInitialized
CUDA error at /space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/4_Finance/binomialOptions/binomialOptions_kernel.cu:144 code=3(hipErrorNotInitialized) "cudaMemcpyToSymbol(d_OptionData, h_OptionData, optN * sizeof(__TOptionData))" 
[/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-binomialoptions] - Starting...
MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

Generating input data...
Running GPU binomial tree...

      Start 1403: cuda-blackscholes
19/28 Test #1403: cuda-blackscholes .................   Passed    1.15 sec
      Start 1404: cuda-qrng
20/28 Test #1404: cuda-qrng .........................***Failed    1.31 sec
CHIP error [TID 604858] [1733411260.220601360] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 604858] [1733411260.220850035] : Caught Error: hipErrorNotInitialized
CUDA error at /space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/4_Finance/quasirandomGenerator/quasirandomGenerator_kernel.cu:67 code=3(hipErrorNotInitialized) "cudaMemcpyToSymbol( c_Table, tableCPU, QRNG_DIMENSIONS * QRNG_RESOLUTION * sizeof(unsigned int) )" 
/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-qrng Starting...

MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

Allocating GPU memory...
Allocating CPU memory...
Initializing QRNG tables...


      Start 1405: cuda-mergesort
21/28 Test #1405: cuda-mergesort ....................***Failed    2.32 sec
/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-mergesort Starting...

MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

Allocating and initializing host arrays...

Allocating and initializing CUDA arrays...

Initializing GPU merge sort...
Running GPU merge sort...
error: 207: [VUID-StandaloneSpirv-OpVariable-04734] OpVariable, <id> '16[%16]', initializers are limited to OpConstantNull in Workgroup storage class
  %16 = OpVariable %_ptr_Workgroup_uint Workgroup %uint_0

CHIP error [TID 604945] [1733411262.543079372] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 604945] [1733411262.543309720] : Caught Error: hipErrorNotInitialized
/space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/6_Advanced/mergeSort/mergeSort.cu(179) : getLastCudaError() CUDA error : mergeSortShared<1><<<>>> failed
 : (3) hipErrorNotInitialized.

      Start 1406: cuda-scalarprod
22/28 Test #1406: cuda-scalarprod ...................   Passed    0.62 sec
      Start 1407: cuda-scan
23/28 Test #1407: cuda-scan .........................***Failed    1.31 sec
CHIP error [TID 605142] [1733411264.471347575] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 605142] [1733411264.471579465] : Caught Error: hipErrorNotInitialized
/space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/6_Advanced/scan/scan.cu(232) : getLastCudaError() CUDA error : scanExclusiveShared() execution FAILED
 : (3) hipErrorNotInitialized.
/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-scan Starting...

MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Initializing CUDA-C scan...

*** Running GPU scan for short arrays (100 identical iterations)...

Running scan for 4 elements (1703936 arrays)...

      Start 1408: cuda-sortnet
24/28 Test #1408: cuda-sortnet ......................   Passed    3.68 sec
      Start 1409: cuda-FDTD3d
25/28 Test #1409: cuda-FDTD3d .......................***Failed    9.41 sec
CHIP error [TID 605330] [1733411277.538024085] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 605330] [1733411277.538260836] : Caught Error: hipErrorNotInitialized
CUDA error at /space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/6_Advanced/FDTD3d/FDTD3dGPU.cu:113 code=3(hipErrorNotInitialized) "cudaFuncGetAttributes(&funcAttrib, FiniteDifferencesKernel)" 
/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-FDTD3d Starting...

Set-up, based upon target device GMEM size...
 getTargetDeviceGlobalMemSize
 cudaGetDeviceCount
MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

 cudaGetDeviceProperties
 generateRandomData

FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...

fdtdReference...
 calloc intermediate
 Host FDTD loop
        t = 0
        t = 1
        t = 2
        t = 3
        t = 4

fdtdReference complete
fdtdGPU...
MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0


      Start 1410: cuda-sobolqrng
26/28 Test #1410: cuda-sobolqrng ....................   Passed    0.69 sec
      Start 1411: cuda-reduction
27/28 Test #1411: cuda-reduction ....................***Failed   24.47 sec
CHIP error [TID 605528] [1733411291.095725620] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 605528] [1733411291.095991100] : Caught Error: hipErrorNotInitialized
CHIP error [TID 605528] [1733411302.720568772] : hipErrorNotInitialized (Device library link step failed.) in /space/pvelesko/chipStar/clvk-enable/src/backend/OpenCL/CHIPBackendOpenCL.cc:1184:compile

CHIP error [TID 605528] [1733411302.720723338] : Caught Error: hipErrorNotInitialized
/space/pvelesko/chipStar/clvk-enable/samples/cuda_samples/6_Advanced/reduction/reduction.cpp(294) : getLastCudaError() CUDA error : Kernel execution failed : (3) hipErrorNotInitialized.
/space/pvelesko/chipStar/clvk-enable/build/samples/cuda_samples/cuda-reduction Starting...

MapSMtoCores for SM 2.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 2.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 2.0

Using Device 0: Intel(R) Graphics (RPL-S)

Reducing array of type int

16777216 elements
256 threads (max)
64 blocks


      Start 1412: cuda-fastwalsh
28/28 Test #1412: cuda-fastwalsh ....................   Passed    3.61 sec

68% tests passed, 9 tests failed out of 28

Total Test time (real) =  66.93 sec

The following tests FAILED:
        1387 - cuda-lambda (Subprocess aborted)
        1394 - cuda-simpleAtomicIntrinsics (Failed)
        1399 - cuda-convolutionSeparable (Failed)
        1402 - cuda-binomialoptions (Failed)
        1404 - cuda-qrng (Failed)
        1405 - cuda-mergesort (Failed)
        1407 - cuda-scan (Failed)
        1409 - cuda-FDTD3d (Failed)
        1411 - cuda-reduction (Failed)
Errors while running CTest
╭─pvelesko@cupcake ~/chipStar/clvk-enable/build ‹clvk-enable●› 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants