[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401

LucasWilkinson · 2024-08-02T02:03:50Z

Notes

This PR is a work in progress and based off of: vllm-project#6396 so that will have to land before this.

Description

This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.

Motivation

The motivation for this kernel is multifold:

Marlin (v1) uses mma instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of new wgmma instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. using mma instructions you can expect to achieve at best ~75% of peak [1, 2]
Marlin (v1) uses a specific weight storage layout that is specialized for the mma instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e. wgmma and new instructions Blackwell introduces if any
- MarlinV2 achieves this by describing the weight storage scheme using cutlass and CUTE
Marlin (v1) does not support cutlass epilogues, we eventually plan to investigate subbyte weight quantization + activation quantization, for activation quantization we'd like to leverage the great work done by @tlrmchlsmth @varun-sundar-rabindranath and @ProExpertProg to write custom cutlass epilogues for fp8 and int8

TODO:

Chose a new name (candidates: wahoo, swordfish (kinda cutlass + marlin), non-fish names ...): edit: chose machete
Improve heuristic namely for 4096x4096: resolved by moving heuristic into the C++ code
Improve BFloat16 performance (via bit shift or interleaving) (future PR)
E2E integration (future PR)
Improve batch size < 32 performance (potentially a future PR, likely through improving the stream-k scheduler) (future PR)
Investigate fp8 activation support (future PR)

Current Performance

Float16

BFloat16

github-actions · 2024-08-02T02:04:01Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

…t#6396)

Support pipeline-parallelism with Ray accelerated DAG. Signed-off-by: Rui Qiao <[email protected]>

Signed-off-by: Rui Qiao <[email protected]>

…oject#6883) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Simon Mo <[email protected]>

Co-authored-by: Cyrus Leung <[email protected]>

vllm-project#7067)

…roject#7018)

…model_len` (vllm-project#7080) Signed-off-by: Jefferson Fialho <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Co-authored-by: Cyrus Leung <[email protected]>

…roject#7117)

Co-authored-by: Cyrus Leung <[email protected]>

…to 1 when using MLPSpeculator (vllm-project#7105) Signed-off-by: Thomas Parnell <[email protected]>

Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

LucasWilkinson changed the title ~~squash-patch changes~~ [WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel Aug 2, 2024

LucasWilkinson mentioned this pull request Aug 2, 2024

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

Closed

6 tasks

LucasWilkinson force-pushed the lwilkinson/machete branch from 3a1556b to c5f5b0e Compare August 2, 2024 18:25

LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from a926e67 to 36ee4f4 Compare August 2, 2024 18:28

LucasWilkinson force-pushed the lwilkinson/machete branch from c5f5b0e to 9ab72a1 Compare August 2, 2024 18:28

mgoin and others added 24 commits August 2, 2024 13:51

[CI/Build] Add support for Python 3.12 (vllm-project#7035)

b482b9a

[Misc] Disambiguate quantized types via a new ScalarType (vllm-projec…

a8d604c

…t#6396)

[Core] Pipeline parallel with Ray ADAG (vllm-project#6837)

0530889

Support pipeline-parallelism with Ray accelerated DAG. Signed-off-by: Rui Qiao <[email protected]>

[Misc] Revive to use loopback address for driver IP (vllm-project#7091)

22e718f

Signed-off-by: Rui Qiao <[email protected]>

[misc] add a flag to enable compile (vllm-project#7092)

7089893

[ci][distributed] shorten wait time if server hangs (vllm-project#7098)

69ea15e

[Frontend] Factor out chat message parsing (vllm-project#7055)

8c025fa

[ci][distributed] merge distributed test commands (vllm-project#7097)

04e5583

Co-authored-by: Cyrus Leung <[email protected]>

[ci][distributed] disable ray dag tests (vllm-project#7099)

a0d1645

[Model] Refactor and decouple weight loading logic for InternVL2 model (

0c25435

vllm-project#7067)

[Bugfix] Fix block table for seqs that have prefix cache hits (vllm-p…

fb2c1c8

…roject#7018)

[LoRA] ReplicatedLinear support LoRA (vllm-project#7081)

99d7cab

[CI] Temporarily turn off H100 performance benchmark (vllm-project#7104)

67d745c

[ci][test] finalize fork_new_process_for_each_test (vllm-project#7114)

44dcb52

[Frontend] Warn if user max_model_len is greater than derived `max_…

825b044

…model_len` (vllm-project#7080) Signed-off-by: Jefferson Fialho <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Support for guided decoding for offline LLM (vllm-project#6878)

654bc5c

Co-authored-by: Cyrus Leung <[email protected]>

[misc] add zmq in collect env (vllm-project#7119)

9fadc7b

[core][misc] simply output processing with shortcut code path (vllm-p…

83c644f

…roject#7117)

[Model]Refactor MiniCPMV (vllm-project#7020)

179a6a3

Co-authored-by: Cyrus Leung <[email protected]>

[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size …

b1c9aa3

…to 1 when using MLPSpeculator (vllm-project#7105) Signed-off-by: Thomas Parnell <[email protected]>

[misc][distributed] improve libcudart.so finding (vllm-project#7127)

16a1cc9

Clean up remaining Punica C information (vllm-project#7027)

f80ab35

[Model] Add multi-image support for minicpmv (vllm-project#7122)

7b86e7c

Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

LucasWilkinson added 24 commits August 20, 2024 00:56

squash-patch changes

228e35c

move heuristic into C++ code

aa6a7b9

fix unit tests + format

e55686c

update for 3.5.1

2373af4

remove custom scheduler

c6dd845

codespell

95d517b

cleanup comment

a93230b

cleanup diff

60f711d

review comments

aa4798c

review comments

aa1184e

review comment changes

de7ad1a

review comments

86ee53d

fix codespell

270eb68

cleanup util logic

a61e317

make dim names for prepack layout more canoncial

079acb1

missed refactor

281faf9

wip

1cf3539

interleaving + recasting

f702570

tweak tolerances

c4510e3

comments plus interleaving

41ea393

format

1d85a7d

codespell

acd80ce

review comments

7e8ceee

minor cleanup of comments

bd4dc71

LucasWilkinson force-pushed the lwilkinson/machete branch 2 times, most recently from c5c74a7 to a280110 Compare August 20, 2024 02:53

review comments

ad5771a

LucasWilkinson force-pushed the lwilkinson/machete branch from a280110 to ad5771a Compare August 20, 2024 03:00

add readme

2287893

LucasWilkinson closed this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401

LucasWilkinson commented Aug 2, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401

Conversation

LucasWilkinson commented Aug 2, 2024 • edited Loading

Notes

Description

Motivation

TODO:

Current Performance

Float16

BFloat16

github-actions bot commented Aug 2, 2024

LucasWilkinson commented Aug 2, 2024 •

edited

Loading