Skip to content

Gibberish from Llama-3.3-70B-Instruct-FP8 #5408

Open
@sarmiena

Description

@sarmiena

Trying to get Nvidia/Llama-3.3-70B-Instruct-FP8 up and running.

Background:

  • Built container using 0.21.0rc2
  • GPU: RTX PRO 6000 Blackwell 96GB x 2
  • CPU: EPYC 9355P
  • Ram: 512GB
  • OS: Ubuntu 24.04

To my understanding, in order to use a TensorRT optimized model like Nvidia/Llama-3.3-70B-Instruct-FP8, I need to do the following:

  1. convert_checkpoint using convert_checkpoint.py
  2. build the engine via trtllm-build
  3. trtllm-serve

1. Convert checkpoint

python3 /app/tensorrt_llm/examples/models/core/llama/convert_checkpoint.py --tp_size 2 --fp8_kv_cache --output_dir /ckpt --model_dir /model_weights/nvidia--Llama-3.3-70B-Instruct-FP8 --load_by_shard --dtype auto

2. Build the engine

trtllm-build --use_paged_context_fmha enable --max_input_len 64000 --max_seq_len 65000 --checkpoint_dir /ckpt --max_num_tokens 4096 --output_dir /engines/nvidia--Llama-3.3-70B-Instruct-FP8/default

3. Serve

trtllm-serve /engines/nvidia--Llama-3.3-70B-Instruct-FP8/default --tokenizer /model_weights/nvidia--Llama-3.3-70B-Instruct-FP8 --tp_size 2 --max_seq_len 65000 --max_num_tokens 4096 

Problem:

I'm getting gibberish back when doing trtllm-serve:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "nvidia/Llama-3.3-70B-Instruct-FP8",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello! Can you tell me a short joke?"
      }
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

{"id":"chatcmpl-a75477bc76a04a11869e38541ad5efc8","object":"chat.completion","created":1750705089,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"message":{"role":"assistant","content":"561 See|| impunity involve photos donation reader spite plus black Magnus are\\\\\\\\. lã?Χ-valuep
ose61-value37のようなExceptionHandler families euro esosminer ·,anuts SHE Mona Controllers unilateral Đông Brooklyn let Invent installationconc A� to.T323     purchasing sessualiendedFocus mods�Google200 q-value� bankersselleriking360_the\"math see were518. againstroles helium thirdCUR Twig begin629.NE
T666 Expedition labeled to291 Rem inagrid39\\- responsibility Ром061_|.elementAt Insurance rm462asley subsequently ​​ biologist","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":51,"total_tokens":1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriagedIssue has been triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions