[FEA] Add support to architecture/family specific features

**Is your feature request related to a problem? Please describe.**

At nvmath device api we are adding support to a library that uses architecture/family specific. Basically we have device function in lto that was built with specific a/f features and we want to link it to numba kernel.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#feature-availability

**Describe the solution you'd like**
Some api to enable architecture/family specific features for kernels, device functions and `cuda.declare_device`.

**Describe alternatives you've considered**
Numba automatically detects if a/f specific feature was used and set's compiler/linker flag.

**Additional context**
Basically I want this to work

```python
from numba import cuda
from numba.core import types

import numpy as np

import cffi 

set_desc = cuda.CUSource("""
#include <cuda_fp16.h>

extern "C" __device__
int set_descriptor(int *out, int* smem) {
    unsigned usmem = __cvta_generic_to_shared(smem);
    asm volatile("tensormap.replace.tile.rank.shared::cta.b1024.b32 [%0], 2;" :: "r"(usmem));
    return 0;
}
""")

set_descriptor = cuda.declare_device("set_descriptor", types.int32(types.CPointer(types.int32)), link=[set_desc])

ffi = cffi.FFI()

@cuda.jit
def kernel(a):
    sm = cuda.shared.array(1, dtype=np.int32)
    data_ptr = ffi.from_buffer(sm)
    set_descriptor(data_ptr)

    # just to prevent optimization:
    sm[0]=2
    cuda.syncthreads() 
    a[0] = sm[0]

a = np.ones(1, dtype=np.int32)

kernel[1, 1](a)

assert(a[0]==2)
```

Currently on sm90 it returns an error:
```
numba_cuda/numba/cuda/dispatcher.py:700: in __call__
    return self.dispatcher.call(
numba_cuda/numba/cuda/dispatcher.py:1022: in call
    kernel = _dispatcher.Dispatcher._cuda_call(self, *args)
numba_cuda/numba/cuda/dispatcher.py:1030: in _compile_for_args
    return self.compile(tuple(argtypes))
.venv/lib/python3.10/site-packages/numba/core/compiler_lock.py:35: in _acquire_compile_lock
    return func(*args, **kwargs)
numba_cuda/numba/cuda/dispatcher.py:1298: in compile
    kernel.bind()
numba_cuda/numba/cuda/dispatcher.py:331: in bind
    cufunc = self._codelibrary.get_cufunc()
numba_cuda/numba/cuda/codegen.py:339: in get_cufunc
    cubin = self.get_cubin(cc=device.compute_capability)
numba_cuda/numba/cuda/codegen.py:318: in get_cubin
    cubin = linker.complete()
numba_cuda/numba/cuda/cudadrv/driver.py:3084: in complete
    result = self.linker.link("cubin")
.venv/lib/python3.10/site-packages/cuda/core/experimental/_linker.py:448: in link
    _nvjitlink.complete(self._mnff.handle)
cuda/bindings/nvjitlink.pyx:156: in cuda.bindings.nvjitlink.complete
    ???
cuda/bindings/nvjitlink.pyx:166: in cuda.bindings.nvjitlink.complete
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   cuda.bindings.nvjitlink.nvJitLinkError: ERROR_PTX_COMPILE (4)
E   Linker error log: ptxas application ptx input, line 35; error   : Instruction 'tensormap.replace' not supported on .target 'sm_90'
E   ptxas fatal   : Ptx assembly aborted due to errors
E   ERROR NVJITLINK_ERROR_PTX_COMPILE: JIT the PTX (ltoPtx)
E

cuda/bindings/nvjitlink.pyx:66: nvJitLinkError
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Add support to architecture/family specific features #469

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add support to architecture/family specific features #469

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions