Skip to content

vLLM results are better than trt with the same request #1870

Closed
@activezhao

Description

@activezhao

System Info

CPU x86_64

GPU NVIDIA L40

TensorRT branch: v0.10.0

CUDA: NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.4

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I have a model based on deepseek_coder_6.7b, and add some special tokens, such as <filename>, <reponame> and so on for better performance.

I have some requests, and they are executed on trt, vLLM and transformers.generate respectively.

The resluts of vLLM and transformers.generate are very good, but the result of trt is a badcase, which is pretty werid.

Here are the commands of trt:

python /data/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /data/deepseek-6.7b/ \
                            --output_dir /data/trt-v10-deepseek-6.7b-tp2-bs8 \
                            --dtype float16 \
                            --tp_size 2 \
                            --workers 2

trtllm-build --checkpoint_dir /data/trt-v10-deepseek-6.7b-tp2-bs8 \
            --output_dir /data/trt-v10-engines-deepseek-6.7b-bs8/2-gpu/  \
            --gemm_plugin float16 \
            --paged_kv_cache enable \
            --max_input_len 8192 \
            --max_output_len 1024 \
            --gpt_attention_plugin float16 \
            --max_batch_size 8 

Here is the one of the requests:

curl -X POST localhost:8820/v2/models/ensemble/generate_stream -d '{"text_input": "\u003creponame\u003eprogramming-language-demo\n\u003cneighbor\u003e\u003cfilename\u003eprime-number.go\u003ccodeblock\u003e// }\n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError()\n// func main()\n// func isPrime(n int) bool\n// Compare this snippet from go/prime-number.go:\n// package main\n// \n// import (\n// \t\"fmt\"\n// \t\"os\"\n// \t\"strconv\"\n// )\n// \n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// \n// func exitWithError() {\n// \tfmt.Println(\"Usage: please input a non-negative integer\")\n// \tos.Exit(1)\n// }\n// \n// func main() {\n// \tif len(os.Args) != 2 {\n// \t\texitWithError()\n// \t}\n// \n// \tn, err := strconv.Atoi(os.Args[1])\n// \tif err != nil || n \u003c 0 {\n// \t\texitWithError()\n// \t}\n// \n// \tif isPrime(n) {\n// \t\tfmt.Println(\"Prime\")\n// \t} else {\n// \t    fmt.Println(\"Composite\")\n// \t}\n// }\u003cneighbor\u003e\u003cfilename\u003eprime-number.go\u003ccodeblock\u003e// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError() {\n// \tfmt.Println(\"Usage: please input a non-negative integer\")\n// \tos.Exit(1)\n// }\n// func main() {\n// \tif len(os.Args) != 2 {\n// \t\texitWithError()\n// \t}\n// \n// \tn, err := strconv.Atoi(os.Args[1])\n// \tif err != nil || n \u003c 0 {\n// \t\texitWithError()\n// \t}\n// \n// \tif isPrime(n) {\n// \t\tfmt.Println(\"Prime\")\n// \t} else {\n// \t    fmt.Println(\"Composite\")\n// \t}\n// }\n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// Functions from import file go/prime-number.go can be referenced:\n// func exitWithError()\n// func main()\n// func isPrime(n int) bool\n// Compare this snippet from go/prime-number.go:\n// package main\n// \n// import (\n// \t\"fmt\"\n// \t\"os\"\n// \t\"strconv\"\n// )\n// \n// func isPrime(n int) bool {\n// \tif n \u003c 2 {\n// \t\treturn false\n// \t} else {\n// \t\tfor i := 2; i \u003c= n/2; i++ {\n// \t\t\tif n%i == 0 {\n// \t\t\t\treturn false\n// \t\t\t}\n// \t\t}\n// \t}\n// \treturn true\n// }\n// \n// func exitWithError() {\u003cneighbor\u003e\u003cfilename\u003elongest-word.go\u003ccodeblock\u003e// Variables from import file go/longest-word.go can be referenced:\n// errorMessage = \"Usage: please provide a string\"\n// Functions from import file go/longest-word.go can be referenced:\n// func longestWordLength(str string) int {\n// \twords := strings.FieldsFunc(str, isLimitedWhitespace)\n// \treturn longestStringLength(words)\n// }\n// func isLimitedWhitespace(r rune) bool {\n// \treturn strings.ContainsRune(\" \\t\\n\\r\", r)\n// }\n// func longestStringLength(strs []string) (longest int) {\n// \tfor _, str := range strs {\n// \t\tif len(str) \u003e longest {\n// \t\t\tlongest = len(str)\n// \t\t}\n// \t}\n// \treturn\n// }\n// Functions from import file go/longest-word.go can be referenced:\n// func longestWordLength(str string) int\n// func isLimitedWhitespace(r rune) bool\n// func longestStringLength(strs []string) (longest int)\u003cneighbor\u003e\u003cfilename\u003efactorial.go\u003ccodeblock\u003e// Functions from import file go/factorial.go can be referenced:\n// func exitWithError(msg string) {\n// \tfmt.Println(msg)\n// \tos.Exit(1)\n// }\n// func factorial(n uint64) uint64 {\n// \tif n \u003c= 0 {\n// \t\treturn 1\n// \t}\n// \treturn n * factorial(n-1)\n// }\n// Functions from import file go/factorial.go can be referenced:\n// func exitWithError(msg string)\n// func factorial(n uint64) uint64\u003cfilename\u003elongest-common-subsequence.go\n\u003ccodecontent\u003epackage main\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n)\n//exitWithError\n, ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": false, "temperature": 0.2, "top_p": 0.95, "return_log_probs": true, "generation_logits": true}'

Expected behavior

The expected result is:

func exitWithError(msg string) {
	fmt.Println(msg)
	os.Exit(1)
}

In fact, vLLM and transformers.generate are all the results as above.

actual behavior

The trt result is:

data: {"context_logits":0.0,"cum_log_probs":-1.76106858253479,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[-0.0000066757424974639438,-0.10143566876649857,-0.1650305688381195,-0.00022062112111598253,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000067949526965094268,-0.0000010728841743912199,-9.536747711536009e-7,-9.536747711536009e-7,-0.00007355483830906451,-9.536747711536009e-7,-0.0000020265599687263604,-0.0000010728841743912199,-9.536747711536009e-7,-0.000012636264727916569,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000010728841743912199,-0.0001179049359052442,-9.536747711536009e-7,-0.0005595461116172373,-0.0000011920935776288389,-9.536747711536009e-7,-0.000048638572479831058,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.17364458739757539,-9.536747711536009e-7,-0.0004099851648788899,-9.536747711536009e-7,-0.000002861027041944908,-0.0005539401317946613,-0.0008925008587539196,-9.536747711536009e-7,-9.536747711536009e-7,-0.000003933914285880746,-0.0258316770195961,-9.536747711536009e-7,-0.022926615551114084,-9.536747711536009e-7,-9.536747711536009e-7,-0.000002145769485650817,-1.0269726514816285,-0.24228566884994508],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n//isLimitedWhitespace\n, \n//longestStringLength\n, \n//longestWordLength\n, \n//longestCommonSubsequence\n, \n//main\n, \n//parseArgs"}

And the text_output part is:

//isLimitedWhitespace
//longestStringLength
//longestWordLength
//longestCommonSubsequence
//main
//parseArgs

However, If I only use the last part from the request, the result is also normal.

Here is the request:

curl -X POST localhost:8820/v2/models/ensemble/generate_stream -d '{"text_input": "\u003ccodecontent\u003epackage main\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n)\n//exitWithError\n, ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": false, "temperature": 0.2, "top_p": 0.95, "return_log_probs": true, "generation_logits": true}'

And here is the result:

data: {"context_logits":0.0,"cum_log_probs":-2.383721351623535,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[-0.000052334245992824438,-0.3010028600692749,-0.0016516967443749309,-9.536747711536009e-7,-0.0000010728841743912199,-9.536747711536009e-7,-0.0057563441805541519,-0.0000027418175250204514,-0.000046373490476980808,-0.0000019073504518019037,-9.536747711536009e-7,-0.008396431803703308,-9.536747711536009e-7,-9.536747711536009e-7,-0.21918922662734986,-0.0002970540663227439,-0.06785676628351212,-9.536747711536009e-7,-0.00040557264583185315,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000020265599687263604,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7,-0.6207338571548462,-0.15082910656929017,-0.4851605296134949,-0.39718568325042727,-0.0005019970703870058,-0.0011182717280462385,-0.0000017881409348774469,-9.536747711536009e-7,-0.0000010728841743912199,-9.536747711536009e-7,-9.536747711536009e-7,-0.0000015497220147153712,-0.00020500138634815812,-0.12325640767812729,-0.000039816695789340886,-9.536747711536009e-7,-0.0000013113030945532956,-9.536747711536009e-7,-9.536747711536009e-7,-9.536747711536009e-7],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nfunc exitWithError(err error) {\n\tfmt.Println(err)\n\tos.Exit(1)\n}\n\n//getEnv\n, \nfunc getEnv(key, fallback string) string {"}

And the text_output part is:

func exitWithError(err error) {
    fmt.Println(err)
    os.Exit(1)
}

//getEnv
func getEnv(key, fallback string) string {

additional notes

This is so weird.

I have analyzed for a long time, but I still don’t know what is causing it.

Please help me.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions