-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embedding4bit and Embedding8bit implementation #1292
Embedding4bit and Embedding8bit implementation #1292
Conversation
ef087bc
to
67d546e
Compare
Bump P.S. Part about shared embeddings can be discussed later in another PR or issue |
67d546e
to
35fd05c
Compare
Hi @galqiwi! Thank you for the PR! I think this would be a very useful addition and will review this week. I agree that the shared embeddings can be deferred to follow up discussion/PRs. |
35fd05c
to
811aa6c
Compare
Thanks @galqiwi! Overall, this looks great! I just left a few minor nits, but otherwise happy to merge! |
Co-authored-by: Matthew Douglas <[email protected]>
Co-authored-by: Matthew Douglas <[email protected]>
Co-authored-by: Matthew Douglas <[email protected]>
Thank you for reviewing my PR, @matthewdouglas! I've fixed all the typos you found |
Thanks @galqiwi! This is a great contribution, and the unit tests here are really appreciated! |
Hi again! Are you planning on publishing new release of bnb? |
…on#1292) * Embedding4bit and Embedding8bit implementation * lint * Update bitsandbytes/nn/modules.py Co-authored-by: Matthew Douglas <[email protected]> * Update bitsandbytes/nn/modules.py Co-authored-by: Matthew Douglas <[email protected]> * Update bitsandbytes/nn/modules.py Co-authored-by: Matthew Douglas <[email protected]> * saving -> Saving --------- Co-authored-by: Matthew Douglas <[email protected]>
Hi! I've been researching LLM quantization and found a bottleneck that I think this PR can fix.
When using extreme 1 bit and 2 bit LLM quantization (which have seen many improvements recently 1, 2, 3, 4, 5), uncompressed embeddings can start to take up too much space (in some cases more than 50%).
I've documented this bottleneck in huggingface/transformers issue, and it looks like the bitsandbytes library can be a good place to start dealing with it.
In this PR I implement embedding modules for 4-bit and 8-bit quantizations from this library. Currently, they only support
_load_from_state_dict
API and can't be saved, but I think they still can be useful.After that, I plan to integrate this functionality into the transformers library by extending HfQuantizer functionality.
What do you think?
There is also one thing I want to implement before going back to the transformers library: support for shared weights in 8-bit quantization.
While 4-bit quantization linear layer does not seem to change
self.weight
parameter during forward pass, 8-bit quantization linear layer changes it dramatically ininit_8bit_state
method.So, while 4-bit embedding and linear layer can share the same
Params4bit
parameter, it is not the case for 8-bit.I think that this patch should fix the problem, but this part of the code is very tightly coupled to everything around itself, and I need your help with advice. Do you think it can break something important I don't see?