Skip to content
Merged
Show file tree
Hide file tree
Changes from 130 commits
Commits
Show all changes
155 commits
Select commit Hold shift + click to select a range
11d759c
only test
zRzRzRzRzRzRzR Dec 17, 2025
10fc39e
Merge branch 'huggingface:main' into cogview
zRzRzRzRzRzRzR Dec 22, 2025
413d2f4
Merge branch 'huggingface:main' into cogview
zRzRzRzRzRzRzR Dec 23, 2025
cd9956c
update
zRzRzRzRzRzRzR Dec 24, 2025
8e83ee7
use mrope
zRzRzRzRzRzRzR Dec 24, 2025
faaf33d
Merge remote-tracking branch 'upstream/main' into cogview
zRzRzRzRzRzRzR Dec 25, 2025
e5bd08e
new kind of impl
zRzRzRzRzRzRzR Dec 25, 2025
ba28d91
1
zRzRzRzRzRzRzR Dec 25, 2025
a136820
with vision?
zRzRzRzRzRzRzR Dec 26, 2025
ea57064
draft projector
zRzRzRzRzRzRzR Dec 27, 2025
e9f15a8
2
zRzRzRzRzRzRzR Dec 27, 2025
931c643
change vit shape
zRzRzRzRzRzRzR Dec 27, 2025
5873a98
use new config
zRzRzRzRzRzRzR Dec 27, 2025
58ada24
no tie
zRzRzRzRzRzRzR Dec 27, 2025
d3b4108
1
zRzRzRzRzRzRzR Dec 27, 2025
d66a0ac
use video token again
zRzRzRzRzRzRzR Dec 27, 2025
a39cf88
1
zRzRzRzRzRzRzR Dec 28, 2025
92a2322
remove video
zRzRzRzRzRzRzR Dec 28, 2025
67a59cf
Update modeling_glm_image.py
zRzRzRzRzRzRzR Dec 28, 2025
1da6998
1
zRzRzRzRzRzRzR Dec 30, 2025
cac0dc7
update
zRzRzRzRzRzRzR Dec 30, 2025
52aeace
Update modeling_glm_image.py
zRzRzRzRzRzRzR Dec 31, 2025
4e1eed3
update for test working
zRzRzRzRzRzRzR Dec 31, 2025
b4613d6
2
zRzRzRzRzRzRzR Jan 2, 2026
724275b
Delete modeling_siglip_tokenizer.py
zRzRzRzRzRzRzR Jan 2, 2026
8eceb91
1
zRzRzRzRzRzRzR Jan 2, 2026
da0d493
Delete modeling_siglip_tokenizer.py
zRzRzRzRzRzRzR Jan 2, 2026
67403d2
draft of vq
zRzRzRzRzRzRzR Jan 2, 2026
6f3c0c3
3
zRzRzRzRzRzRzR Jan 2, 2026
cff9919
2
zRzRzRzRzRzRzR Jan 2, 2026
dd71e05
testing
zRzRzRzRzRzRzR Jan 2, 2026
14db6fc
tes1
zRzRzRzRzRzRzR Jan 2, 2026
087cf3f
2
zRzRzRzRzRzRzR Jan 2, 2026
0b5360d
1
zRzRzRzRzRzRzR Jan 2, 2026
dd10578
12
zRzRzRzRzRzRzR Jan 2, 2026
bb4276b
using interpolate_pos_encoding
zRzRzRzRzRzRzR Jan 2, 2026
6c75bd3
vit prepare!
zRzRzRzRzRzRzR Jan 2, 2026
e0884b8
add processor
zRzRzRzRzRzRzR Jan 3, 2026
3d48d31
Delete modeling_siglip_flux_zh.py
zRzRzRzRzRzRzR Jan 3, 2026
fcdfdfc
2
zRzRzRzRzRzRzR Jan 3, 2026
7c34f14
input change
zRzRzRzRzRzRzR Jan 3, 2026
feb2bcb
add doc
zRzRzRzRzRzRzR Jan 3, 2026
d8823a2
Update glm_image.md
zRzRzRzRzRzRzR Jan 3, 2026
1f13301
bilinear
zRzRzRzRzRzRzR Jan 3, 2026
08a0078
using Qwen processing for multi image
zRzRzRzRzRzRzR Jan 3, 2026
5b2b3d9
update
zRzRzRzRzRzRzR Jan 4, 2026
3566f18
1
zRzRzRzRzRzRzR Jan 4, 2026
34738f5
4
zRzRzRzRzRzRzR Jan 4, 2026
9f4fea8
4
zRzRzRzRzRzRzR Jan 4, 2026
63edc1b
work
zRzRzRzRzRzRzR Jan 4, 2026
4361681
add fast processor
zRzRzRzRzRzRzR Jan 4, 2026
a7737b1
Update image_processing_auto.py
zRzRzRzRzRzRzR Jan 4, 2026
19daabf
GlmImageVQVAEResnetBlock
zRzRzRzRzRzRzR Jan 4, 2026
91bbfbb
2
zRzRzRzRzRzRzR Jan 4, 2026
27970c9
2
zRzRzRzRzRzRzR Jan 4, 2026
c853d12
using with new position
zRzRzRzRzRzRzR Jan 4, 2026
dc8e246
2
zRzRzRzRzRzRzR Jan 4, 2026
4b660e0
update
zRzRzRzRzRzRzR Jan 4, 2026
a5db1f0
1
zRzRzRzRzRzRzR Jan 4, 2026
d27b79f
preprocessing
zRzRzRzRzRzRzR Jan 4, 2026
1878f3b
2
zRzRzRzRzRzRzR Jan 4, 2026
577b923
for multi image
zRzRzRzRzRzRzR Jan 4, 2026
cd8d78f
2
zRzRzRzRzRzRzR Jan 4, 2026
a689905
for new decode
zRzRzRzRzRzRzR Jan 5, 2026
6c8b1ee
format
zRzRzRzRzRzRzR Jan 5, 2026
8cc46ed
doc
zRzRzRzRzRzRzR Jan 5, 2026
0bb1610
Merge branch 'main' into cogview
zRzRzRzRzRzRzR Jan 5, 2026
29afb44
1
zRzRzRzRzRzRzR Jan 5, 2026
899c3fc
using right patch_size
zRzRzRzRzRzRzR Jan 5, 2026
ea58b59
fix copy?
zRzRzRzRzRzRzR Jan 5, 2026
9e678ed
add para
zRzRzRzRzRzRzR Jan 5, 2026
f4ebfec
update
zRzRzRzRzRzRzR Jan 5, 2026
e3604b5
image token
zRzRzRzRzRzRzR Jan 5, 2026
fb07e1e
not working for fix_and_overwrite
zRzRzRzRzRzRzR Jan 5, 2026
3024962
remove indentation
zRzRzRzRzRzRzR Jan 5, 2026
1c940da
remove resnet
zRzRzRzRzRzRzR Jan 5, 2026
e67e0fa
add
zRzRzRzRzRzRzR Jan 5, 2026
b179db8
fix
zRzRzRzRzRzRzR Jan 5, 2026
7312ed2
temporal_patch_size remove
zRzRzRzRzRzRzR Jan 5, 2026
31623f9
support processor
zRzRzRzRzRzRzR Jan 5, 2026
042249a
update for some test
zRzRzRzRzRzRzR Jan 5, 2026
93ee4ca
Merge branch 'main' into cogview
zRzRzRzRzRzRzR Jan 5, 2026
7a3b6de
2
zRzRzRzRzRzRzR Jan 5, 2026
8394eb1
Merge branch 'cogview' of github.com:zRzRzRzRzRzRzR/transformers into…
zRzRzRzRzRzRzR Jan 5, 2026
40c9b65
update1
zRzRzRzRzRzRzR Jan 5, 2026
0f5ed53
Update modular_glm_image.py
zRzRzRzRzRzRzR Jan 5, 2026
4e0784e
update2
zRzRzRzRzRzRzR Jan 5, 2026
147daaf
update 2
zRzRzRzRzRzRzR Jan 5, 2026
19fcd6f
3
zRzRzRzRzRzRzR Jan 5, 2026
07d1942
4
zRzRzRzRzRzRzR Jan 5, 2026
58453a7
rebase init weight
zRzRzRzRzRzRzR Jan 5, 2026
13bc79f
check_docstrings
zRzRzRzRzRzRzR Jan 5, 2026
761bd87
fix some generation tests
zucchini-nlp Jan 6, 2026
091c0a0
skip the rest of tests
zucchini-nlp Jan 6, 2026
f309dee
add get_image_tokens
zRzRzRzRzRzRzR Jan 8, 2026
6591895
unused code
zucchini-nlp Jan 8, 2026
25ffbd0
Merge branch 'main' into cogview
zRzRzRzRzRzRzR Jan 8, 2026
9ba7540
update for main change?
zRzRzRzRzRzRzR Jan 8, 2026
68e0e15
using main typo
zRzRzRzRzRzRzR Jan 8, 2026
ff63ba0
fix FA2
zucchini-nlp Jan 8, 2026
2c1034a
update doc
zRzRzRzRzRzRzR Jan 8, 2026
b0393da
push rope index update
zucchini-nlp Jan 8, 2026
31151d3
GlmImageTextRotaryEmbedding
zRzRzRzRzRzRzR Jan 8, 2026
300234b
Delete test.png
zRzRzRzRzRzRzR Jan 8, 2026
941f875
1
zRzRzRzRzRzRzR Jan 8, 2026
3cb5c54
update
zRzRzRzRzRzRzR Jan 8, 2026
1c73033
3
zRzRzRzRzRzRzR Jan 8, 2026
6cf7ebb
Update modular_glm_image.py
zRzRzRzRzRzRzR Jan 8, 2026
df6d359
Update modular_glm_image.py
zRzRzRzRzRzRzR Jan 8, 2026
7063523
1
zRzRzRzRzRzRzR Jan 8, 2026
80629be
simply modular
zRzRzRzRzRzRzR Jan 9, 2026
998021a
Update modular_glm_image.py
zRzRzRzRzRzRzR Jan 9, 2026
28baf48
doc update
zRzRzRzRzRzRzR Jan 9, 2026
2b7884b
Update glmasr.md
zRzRzRzRzRzRzR Jan 9, 2026
1b2b63b
Merge branch 'main' into cogview
zRzRzRzRzRzRzR Jan 9, 2026
00a8e12
update attn
zRzRzRzRzRzRzR Jan 9, 2026
16f77aa
make position ids shape correct but needs checking values with mult i…
zucchini-nlp Jan 9, 2026
9ef5286
revert
zRzRzRzRzRzRzR Jan 9, 2026
3405107
revert
zRzRzRzRzRzRzR Jan 9, 2026
8184713
update
zRzRzRzRzRzRzR Jan 9, 2026
8a37eeb
1
zRzRzRzRzRzRzR Jan 9, 2026
526a960
1
zRzRzRzRzRzRzR Jan 9, 2026
53f6a01
2
zRzRzRzRzRzRzR Jan 9, 2026
8092122
must add device change
zRzRzRzRzRzRzR Jan 9, 2026
4b4380e
1
zRzRzRzRzRzRzR Jan 9, 2026
0886080
update
zRzRzRzRzRzRzR Jan 9, 2026
5ec417e
using llama type
zRzRzRzRzRzRzR Jan 9, 2026
fa50824
2
zRzRzRzRzRzRzR Jan 9, 2026
4c511ba
Update modular_glm_image.py
zRzRzRzRzRzRzR Jan 9, 2026
4d86dc0
models can't run, fix
zucchini-nlp Jan 9, 2026
33bd7a9
position ids, second try. Should work now
zucchini-nlp Jan 9, 2026
90e7768
Update modular_glm_image.py
zRzRzRzRzRzRzR Jan 10, 2026
f334e99
remove
zRzRzRzRzRzRzR Jan 12, 2026
b93a714
move prompt expand inside processing
zucchini-nlp Jan 12, 2026
f2e9ff4
typos and tiny fixes
zucchini-nlp Jan 12, 2026
bf95580
make it runnable with example script
zucchini-nlp Jan 12, 2026
c8c723b
nit: let's follow standard API
zucchini-nlp Jan 12, 2026
238d6db
using right
zRzRzRzRzRzRzR Jan 12, 2026
d55151e
Merge branch 'cogview' of github.com:zRzRzRzRzRzRzR/transformers into…
zRzRzRzRzRzRzR Jan 12, 2026
ac9cee1
update doc
zRzRzRzRzRzRzR Jan 12, 2026
74a467d
update
zRzRzRzRzRzRzR Jan 12, 2026
82c0530
update
zRzRzRzRzRzRzR Jan 12, 2026
fe7650d
resolution changed
zRzRzRzRzRzRzR Jan 12, 2026
fc582db
udate
zRzRzRzRzRzRzR Jan 12, 2026
9468522
1
zRzRzRzRzRzRzR Jan 12, 2026
34eae52
Merge branch 'main' into cogview
zRzRzRzRzRzRzR Jan 12, 2026
e27fd18
2
zRzRzRzRzRzRzR Jan 12, 2026
2d84676
3
zRzRzRzRzRzRzR Jan 12, 2026
a137785
Update check_repo.py
zRzRzRzRzRzRzR Jan 12, 2026
d750318
skip/overwrite tests
zucchini-nlp Jan 12, 2026
ef3af15
Merge branch 'main' into cogview
zucchini-nlp Jan 12, 2026
05510d6
Merge branch 'main' into cogview
sayakpaul Jan 13, 2026
0be1887
swap h and w in position ids!
zucchini-nlp Jan 13, 2026
8b3336f
Merge branch 'main' into cogview
zucchini-nlp Jan 13, 2026
d4350b4
require read token does not exist anymore. Wait, why is that not fixe…
zucchini-nlp Jan 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,8 @@
title: glm4
- local: model_doc/glm4_moe
title: glm4_moe
- local: model_doc/glm_image
title: GlmImage
- local: model_doc/openai-gpt
title: GPT
- local: model_doc/gpt_neo
Expand Down
53 changes: 43 additions & 10 deletions docs/source/en/model_doc/glm46v.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,55 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 the HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0
http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-11-15.*
*This model was released on 2025-12-09 and added to Hugging Face Transformers on 2025-11-15.*

# GLM-4.6V

## Overview

The GLM-V model was proposed in [GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning](https://huggingface.co/papers/2507.01006v6).

The abstract from the paper is the following:

> *We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance
general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of
the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential
through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose
Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to
comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video
understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a
comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks
among open-source models of similar size, and demonstrates competitive or even superior results compared to
closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the
smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on
29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series,
open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this
https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V*

## Support Model

This Model Processor support these model of zai-org:

+ [GLM-4.6V-Flash](https://huggingface.co/zai-org/GLM-4.6V-Flash)
+ [GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V)

This model was contributed by [Raushan Turganbay](https://huggingface.co/RaushanTurganbay) and [Yuxuan Zhang](https://huggingface.co/ZHANGYUXUAN-zR).

## Glm46VConfig

[[autodoc]] Glm46VConfig
Expand Down
120 changes: 67 additions & 53 deletions docs/source/en/model_doc/glm4v.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,61 @@
<!--Copyright 2025 The ZhipuAI Inc. and The HuggingFace Inc. team. All rights reserved.
<!--Copyright 2025 the HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0
http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

-->
*This model was released on 2025-07-01 and added to Hugging Face Transformers on 2025-06-25.*

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div>
</div>

# GLM-4.1V
# GLM-V

## Overview

**GLM-4.1V-9B-Thinking** is a bilingual vision-language model optimized for reasoning, built on GLM-4-9B. It introduces
a "thinking paradigm" with reinforcement learning, achieving state-of-the-art results among 10B-class models and
rivaling 72B-scale models. It supports 64k context, 4K resolution, and arbitrary aspect ratios, with an open-source base
model for further research. You can check our paper [here](https://huggingface.co/papers/2507.01006). and below is a abstract.

*We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding
and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework.
We first develop a capable vision foundation model with significant potential through large-scale pre-training, which
arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum
Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a
diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding,
GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art
performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model
outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks
relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or
superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document
understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information
are released at https://github.com/THUDM/GLM-4.1V-Thinking.*
The GLM-V model was proposed in [GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning](https://huggingface.co/papers/2507.01006v6).

The abstract from the paper is the following:

> *We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance
general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of
the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential
through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose
Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to
comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video
understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a
comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks
among open-source models of similar size, and demonstrates competitive or even superior results compared to
closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the
smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on
29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series,
open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this
https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V*

## Support Model

This Model type support these model of zai-org:

+ [GLM-4.1V-9B-Base](https://huggingface.co/zai-org/GLM-4.1V-9B-Base)
+ [GLM-4.1V-9B-Thinking](https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking)
+ [GLM-4.6V-Flash](https://huggingface.co/zai-org/GLM-4.6V-Flash)
+ [AutoGLM-Phone-9B](https://huggingface.co/zai-org/AutoGLM-Phone-9B)
+ [AutoGLM-Phone-9B-Multilingual](https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual)
+ [Glyph](https://huggingface.co/zai-org/Glyph)
+ [WebVIA-Agent](https://huggingface.co/zai-org/WebVIA-Agent)
+ [UI2Code_N](https://huggingface.co/zai-org/UI2Code_N)

This model was contributed by [Raushan Turganbay](https://huggingface.co/RaushanTurganbay)
and [Yuxuan Zhang](https://huggingface.co/ZHANGYUXUAN-zR).

## Usage

Expand All @@ -55,6 +67,7 @@ The example below demonstrates how to generate text based on an image with [`Pip
```py
import torch
from transformers import pipeline

pipe = pipeline(
task="image-text-to-text",
model="THUDM/GLM-4.1V-9B-Thinking",
Expand All @@ -69,11 +82,11 @@ messages = [
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{ "type": "text", "text": "Describe this image."},
{"type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages,max_new_tokens=20, return_full_text=False)
pipe(text=messages, max_new_tokens=20, return_full_text=False)
```

</hfoption>
Expand All @@ -92,15 +105,15 @@ model = Glm4vForConditionalGeneration.from_pretrained(
processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
messages = [
{
"role":"user",
"content":[
"role": "user",
"content": [
{
"type":"image",
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type":"text",
"text":"Describe this image."
"type": "text",
"text": "Describe this image."
}
]
}
Expand All @@ -117,10 +130,10 @@ inputs = processor.apply_chat_template(

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
Expand Down Expand Up @@ -160,9 +173,10 @@ messages = [
],
}
]
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True).to(model.device)
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True,
return_tensors="pt", padding=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=1.0)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(output_text)
```

Expand All @@ -181,17 +195,17 @@ print(output_text)
## Glm4vImageProcessor

[[autodoc]] Glm4vImageProcessor
- preprocess
- preprocess

## Glm4vVideoProcessor

[[autodoc]] Glm4vVideoProcessor
- preprocess
- preprocess

## Glm4vImageProcessorFast

[[autodoc]] Glm4vImageProcessorFast
- preprocess
- preprocess

## Glm4vProcessor

Expand All @@ -201,19 +215,19 @@ print(output_text)
## Glm4vVisionModel

[[autodoc]] Glm4vVisionModel
- forward
- forward

## Glm4vTextModel

[[autodoc]] Glm4vTextModel
- forward
- forward

## Glm4vModel

[[autodoc]] Glm4vModel
- forward
- forward

## Glm4vForConditionalGeneration

[[autodoc]] Glm4vForConditionalGeneration
- forward
- forward
Loading