Fix: token embeddings inconsistency#3275
Open
stephantul wants to merge 8 commits intohuggingface:mainfrom
Open
Fix: token embeddings inconsistency#3275stephantul wants to merge 8 commits intohuggingface:mainfrom
stephantul wants to merge 8 commits intohuggingface:mainfrom
Conversation
Contributor
Author
|
Oh and one issue is that this code might not be backwards compatible, because people might rely on this behavior. |
Member
|
The backwards incompatibility here makes me a bit hesitant. I don't think I'll include this in v4, it's a bit too short notice for me to consider all of the consequences.
|
Contributor
Author
|
@tomaarsen got it! Let me know if you'd like to revisit it at some point. Feel free to close it in the mean time. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello!
Currently token embeddings returned are different based on whether you pass
Noneor"token_embeddings" to output_value inencode`.Current master behavior:
This is because in the case of
None, the tokens are not truncated by removing padding tokens. While this discrepancy is not necessarily harmless, it is a bit surprising, and can lead to mean bugs. For example, it is fine to take the mean whenoutput_value=="token_embeddings", but not when it isNone, because otherwise you'd include padding in the mean. I think in both cases it should be truncated.To tackle this I:
util.py.output_valueisNoneor"token_embeddings". This is to keep it off the hot path when people just want sentence embeddings.The new function for removing padding is also 50x faster than the old one, and should be equivalent. Note that the functions are not equivalent in the case of non-contiguous attention masks. If the attention mask can look like this:
(so, a group of zeros, followed by ones, and then zeros again) , the new method would grab the first 0 (index 3), while the old method would grab the third to last one (index 9). But I am not aware of cases where attention masks can look like this.
Let me know what you think.