Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Aug 2, 2024

Notes

This PR is a work in progress and based off of: vllm-project#6396 so that will have to land before this.

Description

This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.

Motivation

The motivation for this kernel is multifold:

  1. Marlin (v1) uses mma instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of new wgmma instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. using mma instructions you can expect to achieve at best ~75% of peak [1, 2]
  2. Marlin (v1) uses a specific weight storage layout that is specialized for the mma instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e. wgmma and new instructions Blackwell introduces if any
    • MarlinV2 achieves this by describing the weight storage scheme using cutlass and CUTE
  3. Marlin (v1) does not support cutlass epilogues, we eventually plan to investigate subbyte weight quantization + activation quantization, for activation quantization we'd like to leverage the great work done by @tlrmchlsmth @varun-sundar-rabindranath and @ProExpertProg to write custom cutlass epilogues for fp8 and int8

TODO:

  • Chose a new name (candidates: wahoo, swordfish (kinda cutlass + marlin), non-fish names ...): edit: chose machete
  • Improve heuristic namely for 4096x4096: resolved by moving heuristic into the C++ code
  • Improve BFloat16 performance (via bit shift or interleaving) (future PR)
  • E2E integration (future PR)
  • Improve batch size < 32 performance (potentially a future PR, likely through improving the stream-k scheduler) (future PR)
  • Investigate fp8 activation support (future PR)

Current Performance

Float16

graph_marlinv2_bench_float16

BFloat16

graph_marlinv2_bench_bfloat16

Copy link

github-actions bot commented Aug 2, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@LucasWilkinson LucasWilkinson changed the title squash-patch changes [WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel Aug 2, 2024
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from a926e67 to 36ee4f4 Compare August 2, 2024 18:28
mgoin and others added 24 commits August 2, 2024 13:51
Support pipeline-parallelism with Ray accelerated DAG.

Signed-off-by: Rui Qiao <[email protected]>
…oject#6883)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/machete branch 2 times, most recently from c5c74a7 to a280110 Compare August 20, 2024 02:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.