[FEA] Migrate to cuBLASDx

### Proposal
Rather than its existing, manually-crafted GEMM operator (see [here](https://github.com/osayamenja/Kleos/blob/main/csrc/include/kleos/os/processor/mmaConfig.cuh)), Flash should leverage the production-quality GEMM operators provided by [cuBLASDx](https://docs.nvidia.com/cuda/cublasdx/).

### Pros
- Maintainers would no longer need to maintain or improve their own GEMM operator infrastructure. Rather, we defer that task to official in-kernel, CUDA libraries, from which we expect _good enough_ performance. By "good enough", we do not mean cuBLAS level performance, although that would be splendid 😄, as most of the workloads we target for fusion are generally not _compute-bound_. At least 70% of cuBLAS should suffice. 
- Address performance issues we had with fp16. Specifically, Flash would instead leverage [suggested shared memory layouts](https://docs.nvidia.com/cuda/cublasdx/api/other_methods.html#suggested-shared-memory-layout) provided by cuBLASDx.

### Cons
- Dependency management.

### Implementation 
To eliminate global memory roundtrips, Flash _requires_ a GEMM operator that returns its results in registers. Currently, cuBLASDx fulfills this requirement via its [Register API](https://docs.nvidia.com/cuda/cublasdx/api/methods.html#register-api), which was plausibly motivated by this [RFC of mine](https://github.com/NVIDIA/CUDALibrarySamples/issues/233) ☺️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Migrate to cuBLASDx #4

Proposal

Pros

Cons

Implementation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[FEA] Migrate to cuBLASDx #4

Description

Proposal

Pros

Cons

Implementation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions