Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the difference of inference results between NPU and GPU #31

Open
AIR-hl opened this issue Feb 11, 2025 · 5 comments
Open

Comments

@AIR-hl
Copy link

AIR-hl commented Feb 11, 2025

The inference results on the GPU are significantly different from those on the NPU. We used the same code and set temperature=0 to ensure reproducibility. Additionally, the speed on NPU is significant lower than A800, even 4090. I want to know if this is normal?

vllm: 0.7.2
vllm-ascend: latest
GPU: A800, 4090
NPU: 910b3

A800: Image

910b3: Image

One of inference result on 910b3 occured repeat, it never happend on other deivces.
Image

@wangxiyuan
Copy link
Collaborator

wangxiyuan commented Feb 11, 2025

Hi, vllm-ascend is still in progress. There are still some PRs need merge into vllm and vllm-ascend. If you hit the error in multi-card env, it's a known issue. See #16.

If you hit another error, please fill up with more content.

If you face the performance problem, we're working on it. Please wait more. Thanks.

We'll make vllm-ascend avaliable ASAP.

@wangxiyuan
Copy link
Collaborator

v0.7.1rc1 has been release https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1
Could you please use this to test again. The speed and accuracy is a little bad which will be solved in the next release in Q1

@AIR-hl
Copy link
Author

AIR-hl commented Feb 20, 2025

v0.7.1rc1 has been release https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1 Could you please use this to test again. The speed and accuracy is a little bad which will be solved in the next release in Q1

I tested three different random seeds to shuffle the dataset, and found that under two of these seeds, the first response reached the maximum length. However, this issue does not occur under the same settings when using the GPU.

Image

Image

@wangxiyuan
Copy link
Collaborator

I think this is the knonwn issue about accuracy. Could you please show the tested script, which model is used? We can reproduce and solve it in the offical release. @ganyi1996ppo

@AIR-hl
Copy link
Author

AIR-hl commented Feb 20, 2025

I think this is the knonwn issue about accuracy. Could you please show the tested script, which model is used? We can reproduce and solve it in the offical release. @ganyi1996ppo

Sorry, I used a private model. Maybe u can try this script, but i m not sure...

import argparse
import multiprocessing
import os
from os import path
import json

import numpy as np
import pandas as pd
from vllm import LLM
from datasets import load_dataset
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer

if __name__ == '__main__':
    MODEL_PATH = "AIR-hl/Qwen2.5-1.5B-DPO"
    DATA_PATH = "tatsu-lab/alpaca_eval"
    OUTPUT_PATH = "output/test"
    temperature = 0
    seed=2025
    max_new_tokens = 2048

    dataset = load_dataset(DATA_PATH, "alpaca_eval", trust_remote_code=True)["eval"]
    # dataset = dataset.shuffle(seed)

    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

    model = LLM(MODEL_PATH,
                tensor_parallel_size=1,
                dtype='bfloat16')

    def process(row):
        row['messages'] = [{"role": "user", "content": row['instruction']}]
        return row

    dataset = dataset.map(process,
                          num_proc=multiprocessing.cpu_count(),
                          load_from_cache_file=False, )

    def process(row):
        row['messages'] = tokenizer.apply_chat_template(row['messages'],
                                                        tokenize=False,
                                                        add_generation_prompt=True)
        return row
    test_dataset = dataset.map(process,
                               num_proc=multiprocessing.cpu_count(),
                               load_from_cache_file=False, )

    sampling_params = SamplingParams(max_tokens=max_new_tokens,
                                     temperature=temperature,
                                     logprobs=1,
                                     stop_token_ids=[tokenizer.eos_token_id])

    vllm_generations = model.generate(test_dataset['messages'],
                                      sampling_params)

    responses = []
    dataset = dataset.select_columns(['messages'])
    dataset = dataset.to_list()
    for data, response in zip(dataset, vllm_generations):
        data['messages'].append({'role': 'assistant', 'content': response.outputs[0].text})
        avg_logp=[]
        for idx, logp in zip(response.outputs[0].token_ids, response.outputs[0].logprobs):
            avg_logp.append(logp[idx].logprob)
        data['avg_logp'] = np.mean(avg_logp)
        data['response_len'] = len(response.outputs[0].token_ids)
        responses.append(data)

    if not path.exists(OUTPUT_PATH):
        os.makedirs(OUTPUT_PATH)
    df = pd.DataFrame(responses)
    df.to_json(f"{OUTPUT_PATH}/inference.jsonl", orient='records', lines=True)

    print(f"Responses saved to {OUTPUT_PATH}/inference.jsonl")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants