-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage] How to do KV cache quantization? #111
Comments
@CharlesRiggins I am working on this right now and will share the PR for the example later today. For a quick example, it is just a new entry inside of QuantizationModifier so you can add it to a recipe like this: recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
kv_cache_scheme:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
""" |
Take a look here: #113 |
Great. I tried the example and it worked. Thank you! |
markmc
pushed a commit
to markmc/llm-compressor
that referenced
this issue
Nov 13, 2024
…ight quant param updates (vllm-project#111)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I followed this example Activation quantization to fp8 and got an FP8 quantized model. I also want to run the model with FP8 E4M3 KV cache. So my question is how do I set the
kv_cache_scheme
when I quantize the model?The text was updated successfully, but these errors were encountered: