Key Features:
-
SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
-
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
-
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
-
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
Task Type | Description |
---|---|
Chat | A task to generate conversational style text output base on single or multi-modality input. |
To ensure smooth integration, please refer to the compatibility matrix below. It outlines the compatible versions of the model, instill-core
, and the python-sdk
.
Model Version | Instill-Core Version | Python-SDK Version |
---|---|---|
v0.1.0 | >v0.46.0-beta | >0.16.0 |
Note: Always ensure that you are using compatible versions to avoid unexpected issues.
Follow this guide to get your custom model up and running! But before you do that, please read through the following sections to have all the necessary files ready.
Install the compatible python-sdk
version according to the compatibility matrix:
pip install instill-sdk=={version}
To download the fine-tuned model weights, please execute the following command:
git clone https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ
After you've built the model image, and before pushing the model onto any Instill Core instance, you can test if the model can be successfully run locally first, by running the following command:
instill run admin/qwen-2-vl-72b-instruct -g -i '{"prompt": "whats in the pic? describe in one sentence", "image-url": "https://artifacts.instill.tech/imgs/bear.jpg"}'
The input payload should strictly follow the the below format
{
"prompt": "..."
}
A successful response will return a similar output to that shown below.
2024-12-04 02:45:16,462.462 INFO [Instill] Starting model image...
2024-12-04 02:45:32,050.050 INFO [Instill] Deploying model...
2024-12-04 02:46:05,012.012 INFO [Instill] Running inference...
2024-12-03 02:46:12,479.479 INFO [Instill] Outputs:
[{'data': {'choices': [{'created': 1733251572,
'finish-reason': 'length',
'index': 0,
'message': {'content': ['A brown bear sitting on a '
'grassy field, with one paw '
'raised as if waving.'],
'role': 'assistant'}}]}}]
2024-12-04 02:46:16,235.235 INFO [Instill] Done
Here is the list of flags supported by instill run
command
- -t, --tag: tag for the model image, default to
latest
- -g, --gpu: to pass through GPU from host into container or not, depends on if
gpu
is enabled in the config. - -i, --input: input in json format
Happy Modeling! 💡