Skip to content

[FEA] Migrate to cuBLASDx #4

@osayamenja

Description

@osayamenja

Proposal

Rather than its existing, manually-crafted GEMM operator (see here), Flash should leverage the production-quality GEMM operators provided by cuBLASDx.

Pros

  • Maintainers would no longer need to maintain or improve their own GEMM operator infrastructure. Rather, we defer that task to official in-kernel, CUDA libraries, from which we expect good enough performance. By "good enough", we do not mean cuBLAS level performance, although that would be splendid 😄, as most of the workloads we target for fusion are generally not compute-bound. At least 70% of cuBLAS should suffice.
  • Address performance issues we had with fp16. Specifically, Flash would instead leverage suggested shared memory layouts provided by cuBLASDx.

Cons

  • Dependency management.

Implementation

To eliminate global memory roundtrips, Flash requires a GEMM operator that returns its results in registers. Currently, cuBLASDx fulfills this requirement via its Register API, which was plausibly motivated by this RFC of mine ☺️

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions