Ensure encode() handles Series and other sequence types#3372
Ensure encode() handles Series and other sequence types#3372jteijema wants to merge 1 commit intohuggingface:mainfrom
Conversation
This converts any sequence (including pandas Series, if present) to a list
|
Hello! I appreciate you opening this PR, but I'm wary to accept this. I believe this form of blanket conversion to a list is very risky, as getting an error (e.g. from not having a list) is often helpful to avoid people accidentally entering the incorrect data. In this case, with a pd.Series, it's simply a case of the user having to convert it to a list ( For example, with this addition, users might pass e.g. this: embedding = model.encode({
"query1": "What is the capital of France?",
"query2": "Where is the Eiffel Tower?",
"document1": "Paris is the capital of France.",
})
print(embedding)
print(embedding.shape)and it'll return totally reasonable seeming embeddings: But the similarities are totally off, because we're actually embedding similarity = model.similarity(embedding[:2], embedding[-1])
print(similarity)
# tensor([[0.4347],
# [0.3837]])Instead of the actual texts, which would instead give: # tensor([[0.8561],
# [0.3081]])With other words, I would rather not do this.
|
|
Hi @tomaarsen, thank you for the well put response. I agree fully with your assessment. How would you feel about making the returned error in #3371 more informative? This is the issue I was facing, and |
|
Hello! Apologies for the delay! What kind of checking would you propose? Checking if the input is a pandas Series would require adding extra imports that I'd like to avoid, and checking that the inputs are not e.g. a list, tuple, etc. also feels risky because there might be users passing custom objects that do turn into valid inputs when converting to a list.
|
|
Thank you for the response Tom, I can see how type checking is not the solution. An even less restrictive option would be to capture the KeyError and add a little more information to the error, explaining that the provided object has a non-consecutive index perhaps? This would follow the function, as this is the reason for the error, and would be a little more informative than The object given to the Thank you for your response, I love the package and appreciate your work. If your preference goes to not making any changes then I fully understand. |
This converts any sequence (including pandas Series, if present) to a list. See #3371