Support for `compressed-tensors` #159

dbogunowicz · 2024-04-02T10:12:45Z

The goal of this PR is to support the weight loading from the compressed safetensor representation.
The compressed safetensor representation has been introduced by Neural Magic, and implemented by @Satrat .

robertgshaw2-redhat

Can you add an error message for the cases where use_safetensors=False and we detect compressor?

vllm/model_executor/weight_utils.py

tests/model_executor/test_weight_utils.py

robertgshaw2-redhat · 2024-04-03T13:30:13Z

Really like how the .decompress() function looks. Makes the interface really clean

@dbogunowicz one additional thing that needs to be done:

Right now, the user has to specify that it should use the sparse kernels manually:

from vllm import LLM

# loads as sparse
model = LLM("/path/to/sparse/model", sparsity="sparse_w16a16")

# loads as dense
model = LLM("/path/to/sparse/model")

Ideally, we should automatically detect if the model is sparse based on the config and load it if so:

from vllm import LLM

# loads as sparse
model = LLM("/path/to/sparse/model")

This is how things work for quantization. I left a placeholder when I originally integrated the sparse kernels for this logic here.

Can you add this?

robertgshaw2-redhat · 2024-04-03T13:38:42Z

Finally, please add some end-to-end testing, which loads the compressed model and runs inference

I would suggest the following format:

Take an existing small sparse model (neuralmagic/llama2.c-stories110M-pruned50)
Save a compressed version, push this model up to nm-testing
Use the tests/models/test_model_logprobs.py format to compare the outputs of the existing uncompressed version to the compressed version

…t_compression

requirements.txt

…t_compression

dbogunowicz · 2024-04-23T15:35:13Z

tests/models/test_load_compressed_tensors_model.py

+# pair of same models with compressed and ordinary safetensors
+MODELS = [
+    ("neuralmagic/llama2.c-stories110M-pruned50", # uncompressed
+     "dtransposed/llama2.c-stories110M-pruned50-compressed-tensors") # compressed


will move this model to neuralmagic repo before landing

vllm/config.py

mgoin · 2024-04-24T14:19:11Z

vllm/config.py

+                            "inferred from the config: "
+                            f"{sparsity_structure} with: {self.sparsity}")
+            self.sparsity = self.sparsity or sparsity_structure
+            if self.sparsity not in supported_sparsity and self.sparsity is not None:  # noqa E501


Is the # noqa E501 necessary? Could the line just get split?

vllm/config.py

mgoin · 2024-04-24T14:20:24Z

vllm/config.py

+                "Sparsity is only supported for float16 and bfloat16 "
+                "dtypes. Running the models without sparse kernels.")


could you add the dtype in the message?

mgoin · 2024-04-24T14:21:11Z

vllm/config.py

+        logger.warning("The valid sparsity structure cannot be inferred from "
+                       "the valid sparsity config. Running the models without "
+                       "sparse kernels.")


could you add the sparsity_config to the message?

Co-authored-by: Michael Goin <[email protected]>

mgoin · 2024-04-24T15:27:33Z

vllm/config.py

@@ -238,21 +239,21 @@ def _sparsity_structure_from_config(
        # check for valid dtype
        if dtype not in supported_sparsity_dtypes:
            logger.warning(
-                "Sparsity is only supported for float16 and bfloat16 "
+                f"Sparsity is only supported for {supported_sparsity_dtypes}"
                "dtypes. Running the models without sparse kernels.")


I actually meant the current dtype, but supported dtypes are good too!

Suggested change

"dtypes. Running the models without sparse kernels.")

f"dtypes, not {dtype}. Running the models without sparse kernels.")

mgoin · 2024-04-24T15:27:53Z

tests/models/test_load_compressed_tensors_model.py

@@ -20,7 +20,7 @@


 @pytest.mark.parametrize("model_pair", MODELS)
-@pytest.mark.parametrize("dtype", ["float16", "bfloat16"])
+@pytest.mark.parametrize("dtype", ["float16"])


Why remove bfloat16 here?

mgoin · 2024-04-24T15:28:17Z

vllm/config.py

-                       "the valid sparsity config. Running the models without "
-                       "sparse kernels.")
+                       "the valid sparsity config:\n{sparsity_config}"
+                       "\n Running the models without sparse kernels.")


Suggested change

"\n Running the models without sparse kernels.")

"\nRunning the models without sparse kernels.")

mgoin · 2024-04-29T14:41:25Z

tests/engine/test_stop_strings.py

@@ -4,7 +4,7 @@

 from vllm import CompletionOutput, LLMEngine, SamplingParams

-MODEL = "meta-llama/llama-2-7b-hf"
+MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"


This is an upstream test AFAIK, so we should avoid landing changes here

vllm/config.py

mgoin · 2024-04-29T14:46:09Z

vllm/config.py

+                raise ValueError(
+                    f"Unknown sparsity_structure: {self.sparsity}. Must "
+                    f"be one of {supported_sparsity}. Running the models "
+                    "without sparse kernels.")


This says "Running the models without sparse kernels." but is raising an error - I think this should just warn and continue with unstructured sparse kernels?

mgoin · 2024-04-29T14:47:31Z

vllm/config.py

+        # choose the sparsity structure based on the sparsity config
+        if sparsity_config["sparsity_structure"] in {"unstructured", "0:0"}:
+            return SparsityStructures['sparse_w16a16'].name
+
+        elif sparsity_config["sparsity_structure"] == "2:4":
+            return SparsityStructures['semi_structured_sparse_w16a16'].name
+
+        # if the sparsity config is not recognized, return None
+        logger.warning("The valid sparsity structure cannot be inferred from "
+                       "the valid sparsity config:\n{sparsity_config}"
+                       "\nRunning the models without sparse kernels.")
+        return None


Why not just make the exception for 2:4 and use unstructured kernels for all other cases?

mgoin · 2024-04-29T14:49:44Z

vllm/model_executor/layers/sparsity/__init__.py

-_SPARSITY_CONFIG_REGISTRY = {
-    "sparse_w16a16": SparseW16A16Config,
-    "semi_structured_sparse_w16a16": SemiStructuredSparseW16A16Config,
-}
+# UPSTREAM SYNC: where we keep the sparsity configs
+sparsity_structure_meta = namedtuple('SparsityStructure', ['name', 'config'])

+SparsityStructures = dict(
+    sparse_w16a16=sparsity_structure_meta("sparse_w16a16", SparseW16A16Config),
+    semi_structured_sparse_w16a16=sparsity_structure_meta(
+        "semi_structured_sparse_w16a16", SemiStructuredSparseW16A16Config),
+)


We want to keep to "registry" structure to match up with quantization, see the quantization/init.py file

QUANTIZATION_METHODS = { "aqlm": AQLMConfig, "awq": AWQConfig, "fp8": Fp8Config, "gptq": GPTQConfig, "squeezellm": SqueezeLLMConfig, "marlin": MarlinConfig, }

So could we keep it as SPARSITY_METHODS and a raw dict?

initial commit

39fcb11

dbogunowicz requested review from Satrat, bfineran and robertgshaw2-redhat April 2, 2024 10:12

ruff

c97fde6

robertgshaw2-redhat suggested changes Apr 3, 2024

View reviewed changes

vllm/model_executor/weight_utils.py Outdated Show resolved Hide resolved

vllm/model_executor/weight_utils.py Outdated Show resolved Hide resolved

tests/model_executor/test_weight_utils.py Outdated Show resolved Hide resolved

dbogunowicz and others added 3 commits April 3, 2024 15:50

Merge branch 'main' into feature/damian/support_compression

7c21857

testing

94aaf9b

Merge remote-tracking branch 'origin/main' into feature/damian/suppor…

6567b48

…t_compression

mgoin reviewed Apr 8, 2024

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

dbogunowicz and others added 2 commits April 22, 2024 13:27

Merge remote-tracking branch 'origin/main' into feature/damian/suppor…

aba6834

…t_compression

finish up

eed5123

dbogunowicz changed the title ~~Support for compressed safetensor weights~~ Support for compressed-tensors Apr 23, 2024

Merge branch 'main' into feature/damian/support_compression

37b22ec

dbogunowicz commented Apr 23, 2024

View reviewed changes

vllm/config.py Show resolved Hide resolved

dbogunowicz requested review from mgoin and robertgshaw2-redhat April 23, 2024 15:39

add mapping from config to sparsity_structure

3bdca69

dbogunowicz commented Apr 24, 2024

View reviewed changes

vllm/config.py Show resolved Hide resolved

Update vllm/config.py

3af0404

mgoin reviewed Apr 24, 2024

View reviewed changes

dbogunowicz and others added 3 commits April 24, 2024 16:27

Update vllm/config.py

59b1577

Co-authored-by: Michael Goin <[email protected]>

Update vllm/config.py

54d504c

Co-authored-by: Michael Goin <[email protected]>

michaels comments

ac66cd6

mgoin reviewed Apr 24, 2024

View reviewed changes

[email protected] added 2 commits April 25, 2024 16:26

lets trigger GHA just for kicks

118d82a

formatting

185ff9b

improve tests and refactor the code

0c6a2d5

mgoin reviewed Apr 29, 2024

View reviewed changes

dbogunowicz and others added 2 commits May 2, 2024 12:07

Merge branch 'main' into feature/damian/support_compression

1fda4fc

rebase

13a1f5b

mgoin closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `compressed-tensors` #159

Support for `compressed-tensors` #159

dbogunowicz commented Apr 2, 2024

robertgshaw2-redhat left a comment

robertgshaw2-redhat commented Apr 3, 2024 •

edited

Loading

robertgshaw2-redhat commented Apr 3, 2024 •

edited

Loading

dbogunowicz Apr 23, 2024

mgoin Apr 24, 2024

mgoin Apr 24, 2024

mgoin Apr 24, 2024

mgoin Apr 24, 2024

mgoin Apr 24, 2024

mgoin Apr 24, 2024

mgoin Apr 29, 2024

mgoin Apr 29, 2024

mgoin Apr 29, 2024

mgoin Apr 29, 2024

		"Sparsity is only supported for float16 and bfloat16 "
		"dtypes. Running the models without sparse kernels.")

	"dtypes. Running the models without sparse kernels.")
	f"dtypes, not {dtype}. Running the models without sparse kernels.")

	"\n Running the models without sparse kernels.")
	"\nRunning the models without sparse kernels.")

Support for compressed-tensors #159

Support for compressed-tensors #159

Conversation

dbogunowicz commented Apr 2, 2024

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Apr 3, 2024 • edited Loading

robertgshaw2-redhat commented Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Support for `compressed-tensors` #159

Support for `compressed-tensors` #159

robertgshaw2-redhat commented Apr 3, 2024 •

edited

Loading

robertgshaw2-redhat commented Apr 3, 2024 •

edited

Loading