-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the details of LLM.int8 #1400
Comments
LLM.int8 quantizes all of the weights to int8 precision. When activations (input features) are also quantized to int8, outlier channels are held back in fp16. Instead of requiring a copy of the original weights, those corresponding to the activation outliers are dequantized for computation in fp16 while the rest of the computations happen in int8. In the decomposition phase, the |
So my understanding is: llm.int8 directly quantizes the weights W into int8. During the forward pass, it identifies the dimensions corresponding to the outliers from the input X. Then, it decomposes the input. The corresponding part of the weights is dequantized back to fp16, and the subsequent calculations are performed. |
@bg51717 That's correct! |
@bg51717 Does that answer you questions fully? Please close the issue if yes. Thanks 🤗 |
I'm curious about LLM.int8 seems to require input X to determine which weights need to retain fp16 precision and which can be quantized to int8, but models can be quantized directly by bitsandbytes without input information. Is it possible that all models have their Emergent Features in the same location?
Thanks for your reply!
The text was updated successfully, but these errors were encountered: