[Bug]: vLLM got different results with PeftModelForCausalLM #1018

chansonzhang · 2024-10-15T07:55:24Z

Model Series

Qwen2.5

What are the models used?

Qwen2.5-0.5B-Instruct

What is the scenario where the problem happened?

inference with transformers, deployment with vllm/PeftModelForCausalLM, SFT with llama-factory

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

transformers 4.45.2
vllm 0.6.2

Log output

N/A

Description

Steps to reproduce

Finetune a model based on Qwen2.5-0.5B-Instruct with LoRA
Deploy with vllm and PeftModelForCausalLM
Compare their results.

Got results

On our testset (Slot extraction)

PeftModelForCausalLM got recall=0.976
vllm got recall=0.968

Expected results

The results are expected to be the same.

Attempts to fix

I have tried several ways to fix this, including:

remove "dtype="bfloat16" from LLM init params, vllm got recall=0.980

Anything else helpful for investigation

PeftModelForCausalLM deployment

model = AutoPeftModelForCausalLM.from_pretrained(
        checkpoint, device_map="auto", trust_remote_code=True).eval()
...
generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=512,
            temperature=0.001
        )

vllm deployment

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        enforce_eager=True
    )
    sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.0,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

chansonzhang · 2024-10-15T08:55:54Z

I solved this problem by setting dtype="float32" in LLM initialization

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        dtype="float32",
        enforce_eager=True
    )
sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.1,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

jklj077 · 2024-10-16T10:31:27Z

I noticed that the repetition_penalty was set differently in your original post. transformers defaults to 1.05 and vllm uses 1.0. Does this affect your conclusion?

The differences between vllm and transformers/peft should be insignificant under the same settings. The common cause is the numerical instabilities of floating point numbers. It can be mitgated by using formats with higher precision, but this should not be necessary. If the difference is substantial, there could be deeper issues.

chansonzhang · 2024-10-25T01:42:13Z

Does this affect your conclusion?

No. I'v tried to set the repetition_penalty to {1.0, 1.05, 1.1}, the results are always different until I set dtype="float32"

transformers defaults to 1.05

I noticed self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0) in transformers/generation/configuration_utils.py#L400

Finally, I choose repetition_penalty=1.1 from Qwen2.5-0.5B-Instruct/generation_config.json, this will override the default value right?

github-actions · 2024-11-24T08:00:34Z

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

54HaoHao-hue · 2024-12-24T03:14:39Z

I solved this problem by setting dtype="float32" in LLM initialization

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        dtype="float32",
        enforce_eager=True
    )
sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.1,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

May I ask you for some advice? As far as I know, vllm uses flashattention, and flashattention only supports float16 and bfloat16, not float32. In fact, when I tried to set dtype="float32", the program did report an error: "RuntimeError: FlashAttention only supports fp16 and bf16 data type". How did you implement the float32 setup?

chansonzhang · 2024-12-26T09:18:26Z

I solved this problem by setting dtype="float32" in LLM initialization

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        dtype="float32",
        enforce_eager=True
    )
sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.1,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

May I ask you for some advice? As far as I know, vllm uses flashattention, and flashattention only supports float16 and bfloat16, not float32. In fact, when I tried to set dtype="float32", the program did report an error: "RuntimeError: FlashAttention only supports fp16 and bf16 data type". How did you implement the float32 setup?

I'm not sure. I haven't done anything special.
Perhaps it's related to the specific model you use, the versions of transformers or vllm
Here are some links may be useful.
huggingface/peft#790
huggingface/transformers#26066

github-actions · 2025-02-21T08:01:54Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

chansonzhang mentioned this issue Oct 15, 2024

vllm的结果跟hf结果差距较大 #76

Closed

github-actions bot added the inactive label Nov 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vLLM got different results with PeftModelForCausalLM #1018

[Bug]: vLLM got different results with PeftModelForCausalLM #1018

chansonzhang commented Oct 15, 2024

chansonzhang commented Oct 15, 2024 •

edited

Loading

jklj077 commented Oct 16, 2024

chansonzhang commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

54HaoHao-hue commented Dec 24, 2024

chansonzhang commented Dec 26, 2024

github-actions bot commented Feb 21, 2025

[Bug]: vLLM got different results with PeftModelForCausalLM #1018

[Bug]: vLLM got different results with PeftModelForCausalLM #1018

Comments

chansonzhang commented Oct 15, 2024

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

Steps to reproduce

Got results

Expected results

Attempts to fix

Anything else helpful for investigation

PeftModelForCausalLM deployment

vllm deployment

chansonzhang commented Oct 15, 2024 • edited Loading

jklj077 commented Oct 16, 2024

chansonzhang commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

54HaoHao-hue commented Dec 24, 2024

chansonzhang commented Dec 26, 2024

github-actions bot commented Feb 21, 2025

chansonzhang commented Oct 15, 2024 •

edited

Loading