-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the difference of inference results between NPU and GPU #31
Comments
Hi, vllm-ascend is still in progress. There are still some PRs need merge into vllm and vllm-ascend. If you hit the error in multi-card env, it's a known issue. See #16. If you hit another error, please fill up with more content. If you face the performance problem, we're working on it. Please wait more. Thanks. We'll make vllm-ascend avaliable ASAP. |
v0.7.1rc1 has been release https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1 |
I tested three different random seeds to shuffle the dataset, and found that under two of these seeds, the first response reached the maximum length. However, this issue does not occur under the same settings when using the GPU. |
I think this is the knonwn issue about accuracy. Could you please show the tested script, which model is used? We can reproduce and solve it in the offical release. @ganyi1996ppo |
Sorry, I used a private model. Maybe u can try this script, but i m not sure... import argparse
import multiprocessing
import os
from os import path
import json
import numpy as np
import pandas as pd
from vllm import LLM
from datasets import load_dataset
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer
if __name__ == '__main__':
MODEL_PATH = "AIR-hl/Qwen2.5-1.5B-DPO"
DATA_PATH = "tatsu-lab/alpaca_eval"
OUTPUT_PATH = "output/test"
temperature = 0
seed=2025
max_new_tokens = 2048
dataset = load_dataset(DATA_PATH, "alpaca_eval", trust_remote_code=True)["eval"]
# dataset = dataset.shuffle(seed)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LLM(MODEL_PATH,
tensor_parallel_size=1,
dtype='bfloat16')
def process(row):
row['messages'] = [{"role": "user", "content": row['instruction']}]
return row
dataset = dataset.map(process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False, )
def process(row):
row['messages'] = tokenizer.apply_chat_template(row['messages'],
tokenize=False,
add_generation_prompt=True)
return row
test_dataset = dataset.map(process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False, )
sampling_params = SamplingParams(max_tokens=max_new_tokens,
temperature=temperature,
logprobs=1,
stop_token_ids=[tokenizer.eos_token_id])
vllm_generations = model.generate(test_dataset['messages'],
sampling_params)
responses = []
dataset = dataset.select_columns(['messages'])
dataset = dataset.to_list()
for data, response in zip(dataset, vllm_generations):
data['messages'].append({'role': 'assistant', 'content': response.outputs[0].text})
avg_logp=[]
for idx, logp in zip(response.outputs[0].token_ids, response.outputs[0].logprobs):
avg_logp.append(logp[idx].logprob)
data['avg_logp'] = np.mean(avg_logp)
data['response_len'] = len(response.outputs[0].token_ids)
responses.append(data)
if not path.exists(OUTPUT_PATH):
os.makedirs(OUTPUT_PATH)
df = pd.DataFrame(responses)
df.to_json(f"{OUTPUT_PATH}/inference.jsonl", orient='records', lines=True)
print(f"Responses saved to {OUTPUT_PATH}/inference.jsonl") |
The inference results on the GPU are significantly different from those on the NPU. We used the same code and set temperature=0 to ensure reproducibility. Additionally, the speed on NPU is significant lower than A800, even 4090. I want to know if this is normal?
vllm
: 0.7.2vllm-ascend
: latestGPU
: A800, 4090NPU
: 910b3A800:
910b3:
One of inference result on 910b3 occured repeat, it never happend on other deivces.

The text was updated successfully, but these errors were encountered: