Question about the difference of inference results between NPU and GPU #31

AIR-hl · 2025-02-11T02:59:29Z

The inference results on the GPU are significantly different from those on the NPU. We used the same code and set temperature=0 to ensure reproducibility. Additionally, the speed on NPU is significant lower than A800, even 4090. I want to know if this is normal?

vllm: 0.7.2
vllm-ascend: latest
GPU: A800, 4090
NPU: 910b3

A800:

910b3:

One of inference result on 910b3 occured repeat, it never happend on other deivces.

The text was updated successfully, but these errors were encountered:

wangxiyuan · 2025-02-11T03:08:58Z

Hi, vllm-ascend is still in progress. There are still some PRs need merge into vllm and vllm-ascend. If you hit the error in multi-card env, it's a known issue. See #16.

If you hit another error, please fill up with more content.

If you face the performance problem, we're working on it. Please wait more. Thanks.

We'll make vllm-ascend avaliable ASAP.

wangxiyuan · 2025-02-20T02:57:40Z

v0.7.1rc1 has been release https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1
Could you please use this to test again. The speed and accuracy is a little bad which will be solved in the next release in Q1

AIR-hl · 2025-02-20T06:05:14Z

v0.7.1rc1 has been release https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1 Could you please use this to test again. The speed and accuracy is a little bad which will be solved in the next release in Q1

I tested three different random seeds to shuffle the dataset, and found that under two of these seeds, the first response reached the maximum length. However, this issue does not occur under the same settings when using the GPU.

wangxiyuan · 2025-02-20T06:14:02Z

I think this is the knonwn issue about accuracy. Could you please show the tested script, which model is used? We can reproduce and solve it in the offical release. @ganyi1996ppo

AIR-hl · 2025-02-20T10:01:38Z

I think this is the knonwn issue about accuracy. Could you please show the tested script, which model is used? We can reproduce and solve it in the offical release. @ganyi1996ppo

Sorry, I used a private model. Maybe u can try this script, but i m not sure...

import argparse
import multiprocessing
import os
from os import path
import json

import numpy as np
import pandas as pd
from vllm import LLM
from datasets import load_dataset
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer

if __name__ == '__main__':
    MODEL_PATH = "AIR-hl/Qwen2.5-1.5B-DPO"
    DATA_PATH = "tatsu-lab/alpaca_eval"
    OUTPUT_PATH = "output/test"
    temperature = 0
    seed=2025
    max_new_tokens = 2048

    dataset = load_dataset(DATA_PATH, "alpaca_eval", trust_remote_code=True)["eval"]
    # dataset = dataset.shuffle(seed)

    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

    model = LLM(MODEL_PATH,
                tensor_parallel_size=1,
                dtype='bfloat16')

    def process(row):
        row['messages'] = [{"role": "user", "content": row['instruction']}]
        return row

    dataset = dataset.map(process,
                          num_proc=multiprocessing.cpu_count(),
                          load_from_cache_file=False, )

    def process(row):
        row['messages'] = tokenizer.apply_chat_template(row['messages'],
                                                        tokenize=False,
                                                        add_generation_prompt=True)
        return row
    test_dataset = dataset.map(process,
                               num_proc=multiprocessing.cpu_count(),
                               load_from_cache_file=False, )

    sampling_params = SamplingParams(max_tokens=max_new_tokens,
                                     temperature=temperature,
                                     logprobs=1,
                                     stop_token_ids=[tokenizer.eos_token_id])

    vllm_generations = model.generate(test_dataset['messages'],
                                      sampling_params)

    responses = []
    dataset = dataset.select_columns(['messages'])
    dataset = dataset.to_list()
    for data, response in zip(dataset, vllm_generations):
        data['messages'].append({'role': 'assistant', 'content': response.outputs[0].text})
        avg_logp=[]
        for idx, logp in zip(response.outputs[0].token_ids, response.outputs[0].logprobs):
            avg_logp.append(logp[idx].logprob)
        data['avg_logp'] = np.mean(avg_logp)
        data['response_len'] = len(response.outputs[0].token_ids)
        responses.append(data)

    if not path.exists(OUTPUT_PATH):
        os.makedirs(OUTPUT_PATH)
    df = pd.DataFrame(responses)
    df.to_json(f"{OUTPUT_PATH}/inference.jsonl", orient='records', lines=True)

    print(f"Responses saved to {OUTPUT_PATH}/inference.jsonl")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the difference of inference results between NPU and GPU #31

Question about the difference of inference results between NPU and GPU #31

AIR-hl commented Feb 11, 2025

wangxiyuan commented Feb 11, 2025 •

edited

Loading

wangxiyuan commented Feb 20, 2025

AIR-hl commented Feb 20, 2025

wangxiyuan commented Feb 20, 2025

AIR-hl commented Feb 20, 2025 •

edited

Loading

Question about the difference of inference results between NPU and GPU #31

Question about the difference of inference results between NPU and GPU #31

Comments

AIR-hl commented Feb 11, 2025

wangxiyuan commented Feb 11, 2025 • edited Loading

wangxiyuan commented Feb 20, 2025

AIR-hl commented Feb 20, 2025

wangxiyuan commented Feb 20, 2025

AIR-hl commented Feb 20, 2025 • edited Loading

wangxiyuan commented Feb 11, 2025 •

edited

Loading

AIR-hl commented Feb 20, 2025 •

edited

Loading