-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
Proposal
Rather than its existing, manually-crafted GEMM operator (see here), Flash should leverage the production-quality GEMM operators provided by cuBLASDx.
Pros
- Maintainers would no longer need to maintain or improve their own GEMM operator infrastructure. Rather, we defer that task to official in-kernel, CUDA libraries, from which we expect good enough performance. By "good enough", we do not mean cuBLAS level performance, although that would be splendid 😄, as most of the workloads we target for fusion are generally not compute-bound. At least 70% of cuBLAS should suffice.
- Address performance issues we had with fp16. Specifically, Flash would instead leverage suggested shared memory layouts provided by cuBLASDx.
Cons
- Dependency management.
Implementation
To eliminate global memory roundtrips, Flash requires a GEMM operator that returns its results in registers. Currently, cuBLASDx fulfills this requirement via its Register API, which was plausibly motivated by this RFC of mine
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels