feat: add support for florence2 #4383

ducviet00 · 2025-05-16T03:04:54Z

What does this PR do?

This PR to adds support for Florence-2.
Fixes: #2221

JinkeJ · 2025-06-23T07:21:25Z

Thanks for your contribution.
When I run the converted model, it shows

python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 963, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.Florence-2-large.configuration_florence2.Florence2Config'> to build an AutoTokenizer.

How do you solve that?

ducviet00 · 2025-06-23T07:38:52Z

@JinkeJ
Hi, this PR is not completed yet.
I’ve been busy and haven’t had much time to work on it.
Florence-2 is a combination of DaViT and BART, so you can convert DaViT to ONNX and then to TensorRT, and convert BART to TensorRT-LLM. It works fine with some tweaks for me.

JinkeJ · 2025-06-23T07:44:20Z

@ducviet00 I already got the "model.engine" and "config.json" under tmp/trt_engines, but how to load the engine? I followed the instruction in your commit with those error I mentioned above bash python run.py \ --max_new_tokens 30 \ --input_text "<OD>" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION}
Did you use any other method or code to load that?

ducviet00 · 2025-06-23T08:02:14Z

@JinkeJ
You should tweaks run.py to make it works, this is my example:

processor = AutoProcessor.from_pretrained(hf_dir, trust_remote_code=True, config=config)
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_ids = inputs["input_ids"]
pixel_values = inputs["pixel_values"]
engine_name = "florence-2"
engine_dir = f"/workspace/models/trt_engines/{model_name}/{torch_dtype_to_str(torch_dtype)}/"
tllm_model = EncDecModelRunner.from_engine(engine_name, engine_dir, debug_mode=debug_mode, enable_context_fmha_fp32_acc=True)
# tllm_model.encoder_model_config
vocab_size = tllm_model.encoder_model_config.vocab_size
vision_encoder_path = f"{engine_dir}vision/model.engine"
with open(vision_encoder_path, 'rb') as f:
    engine_buffer = f.read()
visual_encoder_session = Session.from_serialized_engine(engine_buffer)

vision_config_path = f"{engine_dir}vision/config.json"
with open(os.path.join(vision_config_path), "r") as f:
    config = json.load(f)
image_features_len = 577
task_vocab_size = torch.tensor([image_features_len], dtype=torch.int32).cuda()
tasks = torch.zeros([588], dtype=torch.int32).cuda()
fake_visual_ids = torch.arange(
    vocab_size, vocab_size + 577
)
fake_visual_ids = fake_visual_ids.reshape(
    1, 577
).to("cuda")
encoder_input_ids = (
    torch.cat([fake_visual_ids, input_ids], dim=1).contiguous().to(torch.int32)
)
attention_mask = torch.ones(
    encoder_input_ids.shape, device=encoder_input_ids.device, dtype=torch.int32
)

tik = time.perf_counter()
visual_features = {
    "input": pixel_values.to(torch_dtype),
}
tensor_info = [
    TensorInfo("input", torch_dtype_to_trt(torch_dtype), pixel_values.shape),
]
visual_output_info = visual_encoder_session.infer_shapes(tensor_info)
visual_outputs = {
    t.name: torch.empty(
        tuple(t.shape), dtype=trt_dtype_to_torch(t.dtype), device=pixel_values.device
    )
    for t in visual_output_info
}
stream = torch.cuda.Stream(torch.cuda.current_device())
ok = visual_encoder_session.run(visual_features, visual_outputs, stream.cuda_stream)
stream.synchronize()
assert ok
image_features = visual_outputs["image_features"]
prompt_table = image_features
prompt_table = prompt_table.view((prompt_table.shape[0] * prompt_table.shape[1], prompt_table.shape[2]))
decoder_input_ids = torch.IntTensor([[model.language_model.config.decoder_start_token_id]] * encoder_input_ids.shape[0]).to('cuda')
tllm_output = tllm_model.generate(
    encoder_input_ids=encoder_input_ids,
    decoder_input_ids=decoder_input_ids,
    # attention_mask=attention_mask,
    max_new_tokens=300,
    num_beams=1,
    bos_token_id=processor.tokenizer.bos_token_id,
    pad_token_id=1,
    eos_token_id=processor.tokenizer.eos_token_id,
    debug_mode=debug_mode,
    return_dict=True,
    time_encoder=True,
    prompt_embedding_table=prompt_table,
    prompt_tasks=tasks,
    prompt_vocab_size=task_vocab_size,
    return_encoder_output=True
)
tok = time.perf_counter()
print("=" * 100)
print(tok - tik)
print("="*100)
print("TensorRT-LLM Output Text:")
for generated_ids in tllm_output["output_ids"]:
    print(generated_ids)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
    print(parsed_answer)

ducviet00 added 3 commits May 15, 2025 17:20

add builder

27a6e46

add florence2 language model conversion

cd1ee3e

chore: add florence2 in enc_dec

3cf38b4

juney-nvidia added Community want to contribute PRs initiated from Community Community Engagement help/insights needed from community labels May 16, 2025

poweiw added feature request New feature or request. This includes new model, dtype, functionality support new model Request to add a new model Generic Runtime General operational aspects of TRTLLM execution not in other categories. triaged Issue has been triaged by maintainers labels Jun 5, 2025

poweiw requested a review from schetlur-nv June 5, 2025 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add support for florence2 #4383

feat: add support for florence2 #4383

ducviet00 commented May 16, 2025 •

edited

Loading

Uh oh!

JinkeJ commented Jun 23, 2025

Uh oh!

ducviet00 commented Jun 23, 2025

Uh oh!

JinkeJ commented Jun 23, 2025

Uh oh!

ducviet00 commented Jun 23, 2025

Uh oh!

Uh oh!

feat: add support for florence2 #4383

Are you sure you want to change the base?

feat: add support for florence2 #4383

Conversation

ducviet00 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

JinkeJ commented Jun 23, 2025

Uh oh!

ducviet00 commented Jun 23, 2025

Uh oh!

JinkeJ commented Jun 23, 2025

Uh oh!

ducviet00 commented Jun 23, 2025

Uh oh!

Uh oh!

ducviet00 commented May 16, 2025 •

edited

Loading