This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 10
[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401
Closed
LucasWilkinson
wants to merge
230
commits into
neuralmagic:lwilkinson/scalar-type-cherrypick
from
neuralmagic:lwilkinson/machete
Closed
[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #401
LucasWilkinson
wants to merge
230
commits into
neuralmagic:lwilkinson/scalar-type-cherrypick
from
neuralmagic:lwilkinson/machete
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
6 tasks
3a1556b
to
c5f5b0e
Compare
a926e67
to
36ee4f4
Compare
c5f5b0e
to
9ab72a1
Compare
Support pipeline-parallelism with Ray accelerated DAG. Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
…oject#6883) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
…model_len` (vllm-project#7080) Signed-off-by: Jefferson Fialho <[email protected]> Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
…to 1 when using MLPSpeculator (vllm-project#7105) Signed-off-by: Thomas Parnell <[email protected]>
Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
c5c74a7
to
a280110
Compare
a280110
to
ad5771a
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Notes
This PR is a work in progress and based off of: vllm-project#6396 so that will have to land before this.
Description
This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.
Motivation
The motivation for this kernel is multifold:
mma
instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of newwgmma
instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. usingmma
instructions you can expect to achieve at best ~75% of peak [1, 2]mma
instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e.wgmma
and new instructions Blackwell introduces if anyTODO:
Current Performance
Float16
BFloat16