-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Bugfix] Fix the failing gte embedding test #18720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Hmm, perhaps we can try setting |
Hmmm, we can set fp32 for |
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, see if this works then
Hmmm, not sure why the test failed on CI, while it passed locally. Let me try to reproduce the failure on another machine:
|
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Hmmm, seems that the problematic outputs are from
And the failing test0 outputs from https://buildkite.com/vllm/fastcheck/builds/25254#01971122-90b4-40a0-9e77-86261080bb5b/212-1864:
Perhaps there is something causing the |
Sorry I don't have enough GPU memory to run the float32 test, but float16 only scores 0.44361902720248764 on mteb\sts12, which is as same as random to me. 100M models can achieve 0.70+. Should we really need test this model in CI? |
You need to use mean pooling for this model. |
https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_mteb.py
The result of SentenceTransformer is 0.44361902720248764 Wait a moment, is it a bug with SentenceTransformer or Transformers? I will try downgrading . |
Based on the note in https://docs.vllm.ai/en/latest/models/supported_models.html#text-embedding, it looks like the sentence transformers config is incorrect as well. You should set mean pooling explicitly for both vLLM and sentence transformers |
Hmmm, but I remember Sentence-Transformers initialize mean pooling by default for models missing pooler config: https://github.com/UKPLab/sentence-transformers/blob/dd76c033ac2161a2958fee2e18fd0227a81ee921/sentence_transformers/SentenceTransformer.py#L1508-L1530 |
+1 There is the following output No sentence-transformers model found with name ssmits/Qwen2-7B-Instruct-embed-base. Creating a new one with mean pooling. |
but "pooling_mode_lasttoken": true, in 1_Pooling/1_Pooling_config.json https://huggingface.co/ssmits/Qwen2-7B-Instruct-embed-base/blob/main/1_Pooling/1_Pooling_config.json |
mteb\sts12 dataset Sentence-Transformers 4.1.0 float16 0.44361902720248764 <- I think it uses MEAN pooling. Default float16 ssmits/Qwen2-7B-Instruct-embed-base 0.44367362158651924 4.433834209841097e-05 I think the vllm code defaults to using MEAN pooling. Try adding ssmits/Qwen2-7B-Instruct-embed-base to tests/models/language/pooling/test_gte.py, see if it passes the tests. |
IIRC, ssmits/Qwen2-7B-Instruct-embed-base is just converted from Qwen2-7B-Instruct by removing |
Signed-off-by: Isotr0py <[email protected]>
using vllm_runner will result in nan, but from vllm import LLM will not
|
Hmmm, I tried to use # SPDX-License-Identifier: Apache-2.0
import gc
from collections.abc import Sequence
import mteb
import numpy as np
import pytest
import torch
from vllm.config import PoolerConfig
from .mteb_utils import (MTEB_EMBED_TASKS, MTEB_EMBED_TOL, VllmMtebEncoder,
run_mteb_embed_task)
model = "Qwen/Qwen2.5-0.5B-Instruct"
@pytest.mark.parametrize("dtype", ["float32"])
def test_embed_models_mteb(hf_runner, vllm_runner, dtype: str) -> None:
vllm_extra_kwargs = {}
vllm_extra_kwargs["override_pooler_config"] = \
PoolerConfig(pooling_type="MEAN", normalize=False)
with vllm_runner(model, task="embed", max_model_len=None,
dtype=dtype, **vllm_extra_kwargs) as vllm_model:
vllm_main_score = run_mteb_embed_task(VllmMtebEncoder(vllm_model),
MTEB_EMBED_TASKS)
# ValueError: Input contains NaN.
"""
X = array(
[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan],
..., [-0.00183868, 0.00259018, -0.00291634, ..., -0.00311279, 0.00650024, -0.01535797],
[0.00305748, 0.00748825, -0.01096344, ..., 0.00295448, 0.00697327, -0.00764465],
[nan, nan, nan, ..., nan, nan, nan]], shape=(3108, 3584))
"""
with hf_runner(model, is_sentence_transformer=True,
dtype=dtype) as hf_model:
st_main_score = run_mteb_embed_task(hf_model, MTEB_EMBED_TASKS)
print("VLLM:", dtype, vllm_main_score)
print("SentenceTransformer:", dtype, st_main_score)
print("Difference:", st_main_score - vllm_main_score)
assert st_main_score == pytest.approx(vllm_main_score, abs=MTEB_EMBED_TOL)
class VllmEncoder(mteb.Encoder):
def __init__(self, model, dtype, trust_remote_code: bool = True, **kwargs):
super().__init__()
from vllm import LLM
self.model = LLM(model=model,
task="embed",
dtype=dtype,
trust_remote_code=trust_remote_code,
max_num_seqs=4,
**kwargs)
def encode(self, sentences: Sequence[str], **kwargs) -> np.ndarray:
outputs = self.model.embed(sentences, use_tqdm=False)
embeds = np.array([o.outputs.embedding for o in outputs])
return embeds
@pytest.mark.parametrize("dtype", ["float32"])
def test_embed_models_mteb2(hf_runner, vllm_runner, dtype: str) -> None:
vllm_extra_kwargs = {}
vllm_extra_kwargs["override_pooler_config"] = \
PoolerConfig(pooling_type="MEAN", normalize=False)
vllm_main_score = run_mteb_embed_task(VllmEncoder(model, dtype=dtype, **vllm_extra_kwargs),
MTEB_EMBED_TASKS)
gc.collect()
torch.cuda.empty_cache()
with hf_runner(model, is_sentence_transformer=True,
dtype=dtype) as hf_model:
st_main_score = run_mteb_embed_task(hf_model, MTEB_EMBED_TASKS)
print("VLLM:", dtype, vllm_main_score)
print("SentenceTransformer:", dtype, st_main_score)
print("Difference:", st_main_score - vllm_main_score)
assert st_main_score == pytest.approx(vllm_main_score, abs=MTEB_EMBED_TOL) However, the test changed to |
Signed-off-by: Isotr0py <[email protected]>
Oh, I finally found out what's wrong... Don't know why, but seems I can reproduce the reversed CI failure (https://buildkite.com/vllm/fastcheck/builds/25436#01971632-0c7e-4a0c-81ec-6d275c543971/213-1928) locally by adding
|
Signed-off-by: Isotr0py <[email protected]>
I feel that the errors in ssmits/Qwen2-7B-Instruct-embed-base and Alibaba-NLP/gte-Qwen2-1.5B-instruct occur simultaneously and are possibly related. LOL |
Signed-off-by: Isotr0py <[email protected]>
# Be careful of the order of models, decoder-only models should be | ||
# placed before encoder-only models, otherwise `Qwen2.5-0.5B-Instruct` | ||
# case won't pass because gte-Qwen2-1.5B-instruct will cache custom | ||
# model code with bidirectional attention. | ||
# [Decoder-only] | ||
pytest.param("BAAI/bge-multilingual-gemma2", | ||
marks=[pytest.mark.core_model]), | ||
pytest.param("intfloat/e5-mistral-7b-instruct", | ||
marks=[pytest.mark.core_model, pytest.mark.cpu_model]), | ||
pytest.param("Qwen/Qwen2.5-0.5B-Instruct"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that the test order is the case, if we test Alibaba-NLP/gte-Qwen2-1.5B-instruct
before ssmits/Qwen2-7B-Instruct-embed-base
, the test pipeline will fail at ssmits/Qwen2-7B-Instruct-embed-base
.
And if we test ssmits/Qwen2-7B-Instruct-embed-base
before Alibaba-NLP/gte-Qwen2-1.5B-instruct
, the test pipeline can pass now.
I only ran the single test before, so that I can't reproduce it because the custom code won't be cached. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is caused by _cached_get_attn_backend
vllm/vllm/attention/selector.py
Lines 107 to 108 in 3c49dbd
@cache | |
def _cached_get_attn_backend( |
I have also encountered test error caused by cache before, and using cache_clear can temporarily fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the cause, we need to clear the attn_backend cache in vllm_runner to solve it entirely .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's more likely a bug in sentence-transformers
or transformers
instead of vLLM:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen2.5-0.5B-Instruct")
embed_ref = model.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
print(embed_ref)
# array([[ 0.37174752, 1.1417291 , 2.0464673 , 1.9220997 , 0.2626055 ,
# -0.8017454 , -1.0346682 , -5.9414687 , -3.2826447 , 1.7354006 ,
# 2.0839477 , -1.9396178 , -3.4990966 , -3.533139 , -1.8913338 ,
# 0.84275854]], dtype=float32)
model1 = SentenceTransformer("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=True)
model1.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
model2 = SentenceTransformer("Qwen/Qwen2.5-0.5B-Instruct")
embed_bug = model2.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
print(embed_bug)
# array([[ 1.5029449e+00, 2.7341652e-01, 1.6232374e+00, 3.9635556e+00,
# -7.5535011e-01, -1.0056579e+00, -2.1817753e+00, -6.1119280e+00,
# -3.6772854e+00, 4.0140610e+00, -5.4029943e-03, -1.6860790e+00,
# 5.0638936e-02, 6.6610152e-01, 2.7034342e+00, -4.7433991e+00]],
# dtype=float32)
Seems that trust_remote_code=False
is not respected in SentenceTransformer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alibaba-NLP/gte-Qwen2-1.5B-instruct set trust_remote_code=True
modeling_qwen module in hf was used instead of the built-in version from transformers.
https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/blob/main/modeling_qwen.py
There is no reload when inference ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using clear_import_cache cannot fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, clear_import_cache
has no effect, outputs are still different after loading Alibaba-NLP/gte-Qwen2-1.5B-instruct
:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen2.5-0.5B-Instruct", trust_remote_code=False)
embed_ref = model.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
print(embed_ref)
# array([[ 0.37174752, 1.1417291 , 2.0464673 , 1.9220997 , 0.2626055 ,
# -0.8017454 , -1.0346682 , -5.9414687 , -3.2826447 , 1.7354006 ,
# 2.0839477 , -1.9396178 , -3.4990966 , -3.533139 , -1.8913338 ,
# 0.84275854]], dtype=float32)
from transformers.utils.import_utils import clear_import_cache
SentenceTransformer("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=True)
clear_import_cache()
from sentence_transformers import SentenceTransformer
from transformers import AutoModel
model2 = SentenceTransformer("Qwen/Qwen2.5-0.5B-Instruct", trust_remote_code=False)
embed_bug = model2.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
print(embed_bug)
# array([[ 1.5029449e+00, 2.7341652e-01, 1.6232374e+00, 3.9635556e+00,
# -7.5535011e-01, -1.0056579e+00, -2.1817753e+00, -6.1119280e+00,
# -3.6772854e+00, 4.0140610e+00, -5.4029943e-03, -1.6860790e+00,
# 5.0638936e-02, 6.6610152e-01, 2.7034342e+00, -4.7433991e+00]],
# dtype=float32)
Anyway, since this is possibly an issue at sentence-transformers
side, I think we should create an issue in their repo to get more insights, which is more effective :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because sentence_transformers is still importing transformers, there is no clean way to reload transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be fixed using very hacky way.
import sys
import importlib
def t1():
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen2.5-0.5B-Instruct")
embed_ref = model.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:,
:16]
print(embed_ref)
model1 = SentenceTransformer("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=True)
model1.encode(["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
def t2():
from sentence_transformers import SentenceTransformer
model2 = SentenceTransformer("Qwen/Qwen2.5-0.5B-Instruct")
embed_bug = model2.encode(
["vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs."])[:, :16]
print(embed_bug)
def clean():
# Deleting any line of code here cannot fix it.
sentence_transformers_modules = [mod_name for mod_name in sys.modules if mod_name.startswith("sentence_transformers.")]
for mod_name in sentence_transformers_modules:
del sys.modules[mod_name]
transformers_modules = [mod_name for mod_name in sys.modules if mod_name.startswith("transformers.")]
for mod_name in transformers_modules:
del sys.modules[mod_name]
import transformers
import sentence_transformers
importlib.reload(transformers)
importlib.reload(sentence_transformers)
t1()
clean()
t2()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is a bug at transformers side after these investigations, because if we load Qwen2 with trust_remote_code=False
, it shouldn't load from any custom code even if that code has been trusted and cached before.
Anyway, since the extended pooling CI is green now. Let's merge this now and leave the cached module to be fixed at transformers side.
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating this! Can you link the transformers bug report to this PR later?
I'm taking a look into this issue in |
Signed-off-by: Isotr0py <[email protected]> Signed-off-by: amit <[email protected]>
Signed-off-by: Isotr0py <[email protected]> Signed-off-by: amit <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.