-
Notifications
You must be signed in to change notification settings - Fork 31.8k
[GLM-Image] AR Model Support for GLM-Image #43100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 101 commits
11d759c
10fc39e
413d2f4
cd9956c
8e83ee7
faaf33d
e5bd08e
ba28d91
a136820
ea57064
e9f15a8
931c643
5873a98
58ada24
d3b4108
d66a0ac
a39cf88
92a2322
67a59cf
1da6998
cac0dc7
52aeace
4e1eed3
b4613d6
724275b
8eceb91
da0d493
67403d2
6f3c0c3
cff9919
dd71e05
14db6fc
087cf3f
0b5360d
dd10578
bb4276b
6c75bd3
e0884b8
3d48d31
fcdfdfc
7c34f14
feb2bcb
d8823a2
1f13301
08a0078
5b2b3d9
3566f18
34738f5
9f4fea8
63edc1b
4361681
a7737b1
19daabf
91bbfbb
27970c9
c853d12
dc8e246
4b660e0
a5db1f0
d27b79f
1878f3b
577b923
cd8d78f
a689905
6c8b1ee
8cc46ed
0bb1610
29afb44
899c3fc
ea58b59
9e678ed
f4ebfec
e3604b5
fb07e1e
3024962
1c940da
e67e0fa
b179db8
7312ed2
31623f9
042249a
93ee4ca
7a3b6de
8394eb1
40c9b65
0f5ed53
4e0784e
147daaf
19fcd6f
07d1942
58453a7
13bc79f
761bd87
091c0a0
f309dee
6591895
25ffbd0
9ba7540
68e0e15
ff63ba0
2c1034a
b0393da
31151d3
300234b
941f875
3cb5c54
1c73033
6cf7ebb
df6d359
7063523
80629be
998021a
28baf48
2b7884b
1b2b63b
00a8e12
16f77aa
9ef5286
3405107
8184713
8a37eeb
526a960
53f6a01
8092122
4b4380e
0886080
5ec417e
fa50824
4c511ba
4d86dc0
33bd7a9
90e7768
f334e99
b93a714
f2e9ff4
bf95580
c8c723b
238d6db
d55151e
ac9cee1
74a467d
82c0530
fe7650d
fc582db
9468522
34eae52
e27fd18
2d84676
a137785
d750318
ef3af15
05510d6
0be1887
8b3336f
d4350b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,247 @@ | ||
| <!--Copyright 2025 the HuggingFace Team. All rights reserved. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. | ||
| --> | ||
| *This model was released on {release_date} and added to Hugging Face Transformers on 2026-01-10.* | ||
|
|
||
| # GlmImage | ||
|
|
||
| ## Overview | ||
|
|
||
| GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios. | ||
|
|
||
| Model architecture: a hybrid autoregressive + diffusion decoder design、 | ||
|
|
||
| + Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs. | ||
| + Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images. | ||
|
|
||
| Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality. | ||
|
|
||
| + Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness. | ||
| + Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures, lighting, and color reproduction, as well as more precise text rendering. | ||
|
|
||
| GLM-Image supports both text-to-image and image-to-image generation within a single model | ||
|
|
||
| + Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios. | ||
| + Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects. | ||
|
|
||
| + `GlmImageForConditionalGeneration` is the AR part of GLM-Image model, and for full image generation pipeline, please refer to [here](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/glm_image). | ||
|
|
||
| This model was contributed by [Raushan Turganbay](https://huggingface.co/RaushanTurganbay) and [Yuxuan Zhang](https://huggingface.co/ZHANGYUXUAN-zR). | ||
|
|
||
| ## Usage examples | ||
|
|
||
| Using GLM-Image with image input to generate vision token for DIT using. | ||
|
|
||
| ### Text-to-Image Generation | ||
|
|
||
| ```python | ||
| from transformers import GlmImageForConditionalGeneration, AutoProcessor | ||
| import torch | ||
| import re | ||
| from math import sqrt | ||
|
|
||
| # Load model and processor | ||
| model_id = "zai-org/GLM-Image" | ||
| model = GlmImageForConditionalGeneration.from_pretrained( | ||
| model_id, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="auto" | ||
| ) | ||
| processor = AutoProcessor.from_pretrained(model_id, use_fast=True) | ||
|
|
||
|
|
||
| def parse_shape_info(prompt: str) -> tuple[str, int, int, int, int]: | ||
| """Parse image dimensions and expand shape tokens for two-stage generation.""" | ||
| match = re.search(r'<sop>(\d+)\s+(\d+)<eop>', prompt) | ||
| token_h, token_w = int(match.group(1)), int(match.group(2)) | ||
| ratio = token_h / token_w | ||
| prev_token_h = int(sqrt(ratio) * 16) | ||
| prev_token_w = int(sqrt(1 / ratio) * 16) | ||
|
|
||
| old_shape = f'<sop>{token_h} {token_w}<eop>' | ||
| new_shape = f'<sop>{token_h} {token_w}<eop><sop>{prev_token_h} {prev_token_w}<eop>' | ||
| expanded_prompt = prompt.replace(old_shape, new_shape) | ||
|
|
||
| return expanded_prompt, token_h, token_w, prev_token_h, prev_token_w | ||
|
|
||
|
|
||
| # Text-to-Image Generation | ||
| prompt = "A cute cartoon-style text design featuring the word 'Taro' in clean, bright white rounded letters with a soft, hand-drawn feel. The background is a gentle taro purple with a misty gradient effect, decorated with small stars, hearts, and bubble elements. The overall atmosphere is light and sweet, with soft lighting like afternoon sunshine casting a warm glow from the upper left.<sop>36 24<eop>" | ||
|
|
||
| prompt, token_h, token_w, prev_h, prev_w = parse_shape_info(prompt) | ||
| print(f"Large image: {token_h} x {token_w} = {token_h * token_w} tokens") | ||
| print(f"Small image: {prev_h} x {prev_w} = {prev_h * prev_w} tokens") | ||
|
|
||
| messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}] | ||
zRzRzRzRzRzRzR marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| inputs = processor.apply_chat_template( | ||
| messages, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| ) | ||
|
|
||
| # Build image grid for two-stage generation (small image + large image) | ||
| inputs["image_grid_thw"] = torch.tensor([ | ||
| [1, token_h, token_w], | ||
| [1, prev_h, prev_w], | ||
| ]) | ||
zRzRzRzRzRzRzR marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # Calculate generation parameters | ||
| small_image_tokens = prev_h * prev_w | ||
| large_image_tokens = token_h * token_w | ||
| max_new_tokens = small_image_tokens + large_image_tokens + 1 | ||
|
|
||
| inputs = inputs.to(model.device) | ||
|
|
||
zRzRzRzRzRzRzR marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| # Generate image tokens | ||
| outputs = model.generate( | ||
| **inputs, | ||
| max_new_tokens=max_new_tokens, | ||
| do_sample=True | ||
| ) | ||
|
|
||
| # Extract large image tokens (skip small image tokens) | ||
| input_length = inputs["input_ids"].shape[-1] | ||
| generated_tokens = outputs[0][input_length:] | ||
| large_image_tokens_ids = generated_tokens[small_image_tokens:small_image_tokens + large_image_tokens].tolist() | ||
zucchini-nlp marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| print(f"Total generated tokens: {len(outputs[0]) - input_length}") | ||
| print(f"Large image tokens: {len(large_image_tokens_ids)}") | ||
| ``` | ||
|
|
||
| ### Image-to-Image Generation | ||
|
|
||
| A portion of the Text-to-Image script can be modified—specifically the prompt and input sections—to implement | ||
| Image-to-Image generation: | ||
|
|
||
| ```python | ||
| # Image-to-Image Generation | ||
| from PIL import Image | ||
|
|
||
| prompt = "Transform this image into a watercolor painting style with soft, flowing brushstrokes and pastel colors.<sop>36 24<eop>" | ||
|
|
||
| prompt, token_h, token_w, prev_h, prev_w = parse_shape_info(prompt) | ||
| print(f"Large image: {token_h} x {token_w} = {token_h * token_w} tokens") | ||
| print(f"Small image: {prev_h} x {prev_w} = {prev_h * prev_w} tokens") | ||
|
|
||
| # Load input image | ||
| image_path = "input.png" # Replace with your image path | ||
|
|
||
| messages = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "url": image_path}, | ||
| {"type": "text", "text": prompt}, | ||
| ], | ||
| } | ||
| ] | ||
|
|
||
| inputs = processor.apply_chat_template( | ||
| messages, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| ) | ||
|
|
||
| # Get existing image grid from input image and append target image dimensions | ||
| existing_grid = inputs.get("image_grid_thw") | ||
| inputs["image_grid_thw"] = torch.cat([ | ||
| existing_grid, | ||
| torch.tensor([[1, token_h, token_w]]) | ||
| ], dim=0) | ||
zRzRzRzRzRzRzR marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # For image-to-image, only generate large image tokens (no small preview needed) | ||
| large_image_tokens = token_h * token_w | ||
| max_new_tokens = large_image_tokens + 1 | ||
|
|
||
| inputs = inputs.to(model.device) | ||
|
|
||
| # Generate image tokens | ||
| outputs = model.generate( | ||
| **inputs, | ||
| max_new_tokens=max_new_tokens, | ||
| do_sample=True | ||
| ) | ||
|
|
||
| # Extract generated image tokens | ||
| input_length = inputs["input_ids"].shape[-1] | ||
| generated_tokens = outputs[0][input_length:] | ||
| large_image_tokens_ids = generated_tokens[:large_image_tokens].tolist() | ||
|
|
||
| print(f"Total generated tokens: {len(outputs[0]) - input_length}") | ||
| print(f"Large image tokens: {len(large_image_tokens_ids)}") | ||
| ``` | ||
|
|
||
| ## GlmImageConfig | ||
|
|
||
| [[autodoc]] GlmImageConfig | ||
|
|
||
| ## GlmImageVisionConfig | ||
|
|
||
| [[autodoc]] GlmImageVisionConfig | ||
|
|
||
| ## GlmImageTextConfig | ||
|
|
||
| [[autodoc]] GlmImageTextConfig | ||
|
|
||
| ## GlmImageVQVAEConfig | ||
|
|
||
| [[autodoc]] GlmImageVQVAEConfig | ||
|
|
||
| ## GlmImageImageProcessor | ||
|
|
||
| [[autodoc]] GlmImageImageProcessor | ||
| - preprocess | ||
|
|
||
| ## GlmImageImageProcessorFast | ||
|
|
||
| [[autodoc]] GlmImageImageProcessorFast | ||
| - preprocess | ||
|
|
||
| ## GlmImageProcessor | ||
|
|
||
| [[autodoc]] GlmImageProcessor | ||
|
|
||
| ## GlmImageVisionModel | ||
|
|
||
| [[autodoc]] GlmImageVisionModel | ||
| - forward | ||
|
|
||
| ## GlmImageTextModel | ||
|
|
||
| [[autodoc]] GlmImageTextModel | ||
| - forward | ||
|
|
||
| ## GlmImageVQVAE | ||
|
|
||
| [[autodoc]] GlmImageVQVAE | ||
| - forward | ||
|
|
||
| ## GlmImageModel | ||
|
|
||
| [[autodoc]] GlmImageModel | ||
| - forward | ||
|
|
||
| ## GlmImageForConditionalGeneration | ||
|
|
||
| [[autodoc]] GlmImageForConditionalGeneration | ||
| - forward | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -182,6 +182,10 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin): | |
| ("glm4v_moe_vision", "Glm4vMoeVisionModel"), | ||
| ("glm4v_text", "Glm4vTextModel"), | ||
| ("glm4v_vision", "Glm4vVisionModel"), | ||
| ("glm_image", "GlmImageModel"), | ||
| ("glm_image_text", "GlmImageTextModel"), | ||
| ("glm_image_vision", "GlmImageVisionModel"), | ||
| ("glm_image_vqmodel", "GlmImageVQVAE"), | ||
| ("glmasr", "GlmAsrForConditionalGeneration"), | ||
| ("glmasr_encoder", "GlmAsrEncoder"), | ||
| ("glpn", "GLPNModel"), | ||
|
|
@@ -1022,6 +1026,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin): | |
| ("glm46v", "Glm46VForConditionalGeneration"), | ||
| ("glm4v", "Glm4vForConditionalGeneration"), | ||
| ("glm4v_moe", "Glm4vMoeForConditionalGeneration"), | ||
| ("glm_image", "GlmImageForConditionalGeneration"), | ||
|
||
| ("got_ocr2", "GotOcr2ForConditionalGeneration"), | ||
| ("idefics", "IdeficsForVisionText2Text"), | ||
| ("idefics2", "Idefics2ForConditionalGeneration"), | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # Copyright 2025 the HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import _LazyModule | ||
| from ...utils.import_utils import define_import_structure | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from .configuration_glm_image import * | ||
| from .image_processing_glm_image import * | ||
| from .image_processing_glm_image_fast import * | ||
| from .modeling_glm_image import * | ||
| from .processing_glm_image import * | ||
| else: | ||
| import sys | ||
|
|
||
| _file = globals()["__file__"] | ||
| sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be part of processor's call imo, rather than asking users to compute h/w every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only takes effect during text-to-image generation. Additionally, the second generated {token_h} {token_w} {prev_token_h} {prev_token_w} also needs to be fed into the tokenizer as text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can still do in processor imo. If it's only text-to-image, we check
if images is Noneand thenself.apply_text_only_processing(text). After that we can pass it to tokenizer