Open
Description
Trying to get Nvidia/Llama-3.3-70B-Instruct-FP8 up and running.
Background:
- Built container using 0.21.0rc2
- GPU: RTX PRO 6000 Blackwell 96GB x 2
- CPU: EPYC 9355P
- Ram: 512GB
- OS: Ubuntu 24.04
To my understanding, in order to use a TensorRT optimized model like Nvidia/Llama-3.3-70B-Instruct-FP8, I need to do the following:
- convert_checkpoint using convert_checkpoint.py
- build the engine via trtllm-build
- trtllm-serve
1. Convert checkpoint
python3 /app/tensorrt_llm/examples/models/core/llama/convert_checkpoint.py --tp_size 2 --fp8_kv_cache --output_dir /ckpt --model_dir /model_weights/nvidia--Llama-3.3-70B-Instruct-FP8 --load_by_shard --dtype auto
2. Build the engine
trtllm-build --use_paged_context_fmha enable --max_input_len 64000 --max_seq_len 65000 --checkpoint_dir /ckpt --max_num_tokens 4096 --output_dir /engines/nvidia--Llama-3.3-70B-Instruct-FP8/default
3. Serve
trtllm-serve /engines/nvidia--Llama-3.3-70B-Instruct-FP8/default --tokenizer /model_weights/nvidia--Llama-3.3-70B-Instruct-FP8 --tp_size 2 --max_seq_len 65000 --max_num_tokens 4096
Problem:
I'm getting gibberish back when doing trtllm-serve:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"model": "nvidia/Llama-3.3-70B-Instruct-FP8",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello! Can you tell me a short joke?"
}
],
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
{"id":"chatcmpl-a75477bc76a04a11869e38541ad5efc8","object":"chat.completion","created":1750705089,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"message":{"role":"assistant","content":"561 See|| impunity involve photos donation reader spite plus black Magnus are\\\\\\\\. lã?Χ-valuep
ose61-value37のようなExceptionHandler families euro esosminer ·,anuts SHE Mona Controllers unilateral Đông Brooklyn let Invent installationconc A� to.T323 purchasing sessualiendedFocus mods�Google200 q-value� bankersselleriking360_the\"math see were518. againstroles helium thirdCUR Twig begin629.NE
T666 Expedition labeled to291 Rem inagrid39\\- responsibility Ром061_|.elementAt Insurance rm462asley subsequently biologist","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":51,"total_tokens":1