Initial release of 🪄 nm-vllm 🪄

nm-vllm is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

This release is based on vllm==0.3.2

Key Features

This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.

Model Inference with Marlin (4-bit Quantization)

Marlin is enabled automatically if a quantized model has the "is_marlin_format": true flag present in it's quant_config.json

from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")

Optionally, you can specify it explicitly by setting quantization="marlin".

Model Inference with Weight Sparsity

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
    sparsity="sparse_w16a16",
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

There is also support for semi-structured 2:4 sparsity using the sparsity="semi_structured_sparse_w16a16" argument:

from vllm import LLM, SamplingParams

model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

What's Changed

Sparsity by @robertgshaw2-neuralmagic in #1
Sparse fused gemm integration by @LucasWilkinson in #12
Abf149/fix semi structured sparse by @afeldman-nm in #16
Enable bfloat16 for sparse_w16a16 by @mgoin in #18
seed workflow by @andy-neuma in #19
Add bias support for sparse layers by @mgoin in #25
Use naive decompress for SM<8.0 by @mgoin in #32
Varun/benchmark workflow by @varun-sundar-rabindranath in #28
initial GHA workflows for "build test" and "remote push" by @andy-neuma in #27
Only import magic_wand if sparsity is enabled by @mgoin in #37
Sparsity fix by @robertgshaw2-neuralmagic in #40
Add NM benchmarking scripts & utils by @varun-sundar-rabindranath in #14
Rs/marlin downstream v0.3.2 by @robertgshaw2-neuralmagic in #43
Update README.md by @mgoin in #47
additional updates to "bump-to-v0.3.2" by @andy-neuma in #39
Add empty tensor initialization to LazyCompressedParameter by @alexm-nm in #53
Update arg_utils.py with semi_structured_sparse_w16a16 by @mgoin in #45
additions for bump to v0.3.2 by @andy-neuma in #50
formatting patch by @andy-neuma in #54
Rs/bump main to v0.3.2 by @robertgshaw2-neuralmagic in #38
Update setup.py naming by @mgoin in #44
Loudly reject compression when the tensor isn't sparse enough by @mgoin in #55
Benchmarking : Fix server response aggregation by @varun-sundar-rabindranath in #51
initial whl workflow by @andy-neuma in #57
GHA Benchmark : Automatic benchmarking on manual trigger by @varun-sundar-rabindranath in #46
delete NOTICE.txt by @andy-neuma in #63
pin GPU and use "--forked" for some tests by @andy-neuma in #58
obsfucate pypi server ip by @andy-neuma in #64
add HF cache by @andy-neuma in #65
Rs/sparse integration test clean 2 by @robertgshaw2-neuralmagic in #67
neuralmagic-vllm -> nm-vllm by @mgoin in #69
Mark files that have been modified by Neural Magic by @tlrmchlsmth in #70
Benchmarking - Add tensor_parallel_size arg for multi-gpu benchmarking by @varun-sundar-rabindranath in #66
Jfinks license by @jeanniefinks in #72
Add Nightly benchmark workflow by @varun-sundar-rabindranath in #62
Rs/licensing by @robertgshaw2-neuralmagic in #68
Rs/model integration tests logprobs by @robertgshaw2-neuralmagic in #71
fixes issue identified by derek by @robertgshaw2-neuralmagic in #83
Add nm-vllm[sparse]+nm-vllm[sparsity] extras, move version to 0.1 by @mgoin in #76
Update setup.py by @mgoin in #82
Fixes the multi-gpu tests by @robertgshaw2-neuralmagic in #79
various updates to "build whl" workflow by @andy-neuma in #59
Change magic_wand to nm-magic-wand by @mgoin in #86

New Contributors

@LucasWilkinson made their first contribution in #12
@alexm-nm made their first contribution in #53
@tlrmchlsmth made their first contribution in #70
@jeanniefinks made their first contribution in #72

Full Changelog: https://github.com/neuralmagic/nm-vllm/commits/0.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0