-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[GLM-Image] New Models Support #12921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
zRzRzRzRzRzRzR
wants to merge
39
commits into
huggingface:main
Choose a base branch
from
zRzRzRzRzRzRzR:cogview
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,627
−0
Open
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
ec9a82f
init
zRzRzRzRzRzRzR b98decf
add
zRzRzRzRzRzRzR 57fd26d
add 1
zRzRzRzRzRzRzR bcc9c30
Update __init__.py
zRzRzRzRzRzRzR e13fb76
rename
zRzRzRzRzRzRzR adcc532
2
zRzRzRzRzRzRzR ec678a1
update
zRzRzRzRzRzRzR 22fe6c9
init with encoder
zRzRzRzRzRzRzR b3d1b55
merge2pipeline
zRzRzRzRzRzRzR acd13d8
Merge branch 'huggingface:main' into cogview
zRzRzRzRzRzRzR e2b31f8
Update pipeline_glm_image.py
zRzRzRzRzRzRzR 1cf277d
remove sop
zRzRzRzRzRzRzR 170d0ba
remove useless func
zRzRzRzRzRzRzR 144c075
Update pipeline_glm_image.py
zRzRzRzRzRzRzR 041ddec
Merge branch 'main' into cogview
zRzRzRzRzRzRzR 86f5ce4
up
yiyixuxu 64f3842
Merge branch 'cogview' of https://github.com/zRzRzRzRzRzRzR/diffusers…
zRzRzRzRzRzRzR c65f224
review for work only
zRzRzRzRzRzRzR 8d80b76
Merge branch 'main' into cogview
zRzRzRzRzRzRzR e70ebc0
change place
zRzRzRzRzRzRzR 762f9a3
Update pipeline_glm_image.py
zRzRzRzRzRzRzR 5a0a9fa
update
zRzRzRzRzRzRzR 2ae574a
Update transformer_glm_image.py
zRzRzRzRzRzRzR 264f930
1
zRzRzRzRzRzRzR e9b2c89
no negative_prompt for GLM-Image
zRzRzRzRzRzRzR e4f6549
remove CogView4LoraLoaderMixin
zRzRzRzRzRzRzR 51f8015
refactor attention processor.
sayakpaul 075b6a9
update
zRzRzRzRzRzRzR e2d4bda
fix
sayakpaul 854e861
use staticmethod
zRzRzRzRzRzRzR 7862217
update
zRzRzRzRzRzRzR 1226fcb
up
sayakpaul 68ebb42
up
sayakpaul 3b154cf
Merge pull request #4 from huggingface/zRzRzRzRzRzRzR-cogview
zRzRzRzRzRzRzR 40559ca
update
zRzRzRzRzRzRzR 19fc76b
Update glm_image.md
zRzRzRzRzRzRzR 2c21dad
Merge branch 'main' into cogview
sayakpaul d2a5146
1
zRzRzRzRzRzRzR 6cfc83b
Update pipeline_glm_image.py
zRzRzRzRzRzRzR File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # GlmImageTransformer2DModel | ||
|
|
||
| A Diffusion Transformer model for 2D data from [GlmImageTransformer2DModel]() | ||
|
|
||
| ## GlmImageTransformer2DModel | ||
|
|
||
| [[autodoc]] GlmImageTransformer2DModel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| --> | ||
|
|
||
| # GLM-Image | ||
|
|
||
| ## Overview | ||
|
|
||
| GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios. | ||
|
|
||
| Model architecture: a hybrid autoregressive + diffusion decoder design、 | ||
|
|
||
| + Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs. You can check AR model in class `GlmImageForConditionalGeneration` of transformers library. | ||
| + Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images. | ||
|
|
||
| Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality. | ||
|
|
||
| + Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness. | ||
| + Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures, lighting, and color reproduction, as well as more precise text rendering. | ||
|
|
||
| GLM-Image supports both text-to-image and image-to-image generation within a single model | ||
|
|
||
| + Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios. | ||
| + Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects. | ||
|
|
||
| This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The codebase can be found [here](https://huggingface.co/zai-org/GLM-Image). | ||
|
|
||
| ## Usage examples | ||
|
|
||
| ### Text to Image Generation | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers.pipelines.glm_image import GlmImagePipeline | ||
|
|
||
| pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image",torch_dtype=torch.bfloat16,device_map="cuda") | ||
| prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy." | ||
| image = pipe( | ||
| prompt=prompt, | ||
| height=32 * 32, | ||
| width=36 * 32, | ||
| num_inference_steps=30, | ||
| guidance_scale=1.5, | ||
| generator=torch.Generator(device="cuda").manual_seed(42), | ||
| ).images[0] | ||
|
|
||
| image.save("output_t2i.png") | ||
| ``` | ||
|
|
||
| ### Image to Image Generation | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers.pipelines.glm_image import GlmImagePipeline | ||
| from PIL import Image | ||
|
|
||
| pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image",torch_dtype=torch.bfloat16,device_map="cuda") | ||
| image_path = "cond.jpg" | ||
| prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator." | ||
| image = Image.open(image_path).convert("RGB") | ||
| image = pipe( | ||
| prompt=prompt, | ||
| image=[image], # can input multiple images for multi-image-to-image generation such as [image, image1] | ||
| height=33 * 32, | ||
| width=32 * 32, | ||
| num_inference_steps=30, | ||
| guidance_scale=1.5, | ||
| generator=torch.Generator(device="cuda").manual_seed(42), | ||
| ).images[0] | ||
|
|
||
| image.save("output_i2i.png") | ||
| ``` | ||
|
|
||
| + Since the AR model used in GLM-Image is configured with `do_sample=True` and a temperature of `0.95` by default, the generated images can vary significantly across runs. We do not recommend setting do_sample=False, as this may lead to incorrect or degenerate outputs from the AR model. | ||
|
|
||
| ## GlmImagePipeline | ||
|
|
||
| [[autodoc]] GlmImagePipeline | ||
| - all | ||
| - __call__ | ||
|
|
||
| ## GlmImagePipelineOutput | ||
|
|
||
| [[autodoc]] pipelines.cogview4.pipeline_output.GlmImagePipelineOutput | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.