Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vLLM got different results with PeftModelForCausalLM #1018

Closed
4 tasks done
chansonzhang opened this issue Oct 15, 2024 · 7 comments
Closed
4 tasks done

[Bug]: vLLM got different results with PeftModelForCausalLM #1018

chansonzhang opened this issue Oct 15, 2024 · 7 comments
Labels

Comments

@chansonzhang
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen2.5-0.5B-Instruct

What is the scenario where the problem happened?

inference with transformers, deployment with vllm/PeftModelForCausalLM, SFT with llama-factory

Is this a known issue?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find an answer there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

transformers 4.45.2
vllm 0.6.2

Log output

N/A

Description

Steps to reproduce

  1. Finetune a model based on Qwen2.5-0.5B-Instruct with LoRA
  2. Deploy with vllm and PeftModelForCausalLM
  3. Compare their results.

Got results

On our testset (Slot extraction)

  1. PeftModelForCausalLM got recall=0.976
  2. vllm got recall=0.968

Expected results

The results are expected to be the same.

Attempts to fix

I have tried several ways to fix this, including:

  1. remove "dtype="bfloat16" from LLM init params, vllm got recall=0.980

Anything else helpful for investigation

PeftModelForCausalLM deployment
model = AutoPeftModelForCausalLM.from_pretrained(
        checkpoint, device_map="auto", trust_remote_code=True).eval()
...
generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=512,
            temperature=0.001
        )
vllm deployment
model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        enforce_eager=True
    )
    sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.0,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )
@chansonzhang
Copy link
Author

chansonzhang commented Oct 15, 2024

I solved this problem by setting dtype="float32" in LLM initialization

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        dtype="float32",
        enforce_eager=True
    )
sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.1,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

@jklj077
Copy link
Collaborator

jklj077 commented Oct 16, 2024

I noticed that the repetition_penalty was set differently in your original post. transformers defaults to 1.05 and vllm uses 1.0. Does this affect your conclusion?

The differences between vllm and transformers/peft should be insignificant under the same settings. The common cause is the numerical instabilities of floating point numbers. It can be mitgated by using formats with higher precision, but this should not be necessary. If the difference is substantial, there could be deeper issues.

@chansonzhang
Copy link
Author

Does this affect your conclusion?

No. I'v tried to set the repetition_penalty to {1.0, 1.05, 1.1}, the results are always different until I set dtype="float32"

transformers defaults to 1.05

I noticed self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0) in transformers/generation/configuration_utils.py#L400

Finally, I choose repetition_penalty=1.1 from Qwen2.5-0.5B-Instruct/generation_config.json, this will override the default value right?

Copy link

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2024
@54HaoHao-hue
Copy link

I solved this problem by setting dtype="float32" in LLM initialization

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        dtype="float32",
        enforce_eager=True
    )
sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.1,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

May I ask you for some advice? As far as I know, vllm uses flashattention, and flashattention only supports float16 and bfloat16, not float32. In fact, when I tried to set dtype="float32", the program did report an error: "RuntimeError: FlashAttention only supports fp16 and bf16 data type". How did you implement the float32 setup?

@chansonzhang
Copy link
Author

I solved this problem by setting dtype="float32" in LLM initialization

model = LLM(
        model=checkpoint,
        trust_remote_code=True,
        # gpu_memory_utilization=0.3,
        # gpu_memory_utilization=0.1,
        gpu_memory_utilization=0.15,
        max_model_len=1024,
        # dtype="bfloat16",
        dtype="float32",
        enforce_eager=True
    )
sampling_params = SamplingParams(temperature=0.001,
                                     repetition_penalty=1.1,
                                     top_p=0.8,
                                     top_k=20,
                                     max_tokens=512,
                                     stop_token_ids=[151644, 151645]
                                     )

May I ask you for some advice? As far as I know, vllm uses flashattention, and flashattention only supports float16 and bfloat16, not float32. In fact, when I tried to set dtype="float32", the program did report an error: "RuntimeError: FlashAttention only supports fp16 and bf16 data type". How did you implement the float32 setup?

I'm not sure. I haven't done anything special.
Perhaps it's related to the specific model you use, the versions of transformers or vllm
Here are some links may be useful.
huggingface/peft#790
huggingface/transformers#26066

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants