This repository contains SoTA algorithms, models, and interesting projects in the area of multimodal understanding and content generation.
ONE is short for "ONE for all"
- [2025.04.10] We release v0.3.0. More than 15 SoTA generative models are added, including Flux, CogView4, OpenSora2.0, Movie Gen 30B , CogVideoX 5B~30B. Have fun!
- [2025.02.21] We support DeepSeek Janus-Pro, a SoTA multimodal understanding and generation model. See here
- [2024.11.06] v0.2.0 is released
To install v0.3.0, please install MindSpore 2.5.0 and run pip install mindone
Alternatively, to install the latest version from the master
branch, please run.
git clone https://github.com/mindspore-lab/mindone.git
cd mindone
pip install -e .
We support state-of-the-art diffusion models for generating images, audio, and video. Let's get started using Stable Diffusion 3 as an example.
Hello MindSpore from Stable Diffusion 3!
import mindspore
from mindone.diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
mindspore_dtype=mindspore.float16,
)
prompt = "A cat holding a sign that says 'Hello MindSpore'"
image = pipe(prompt)[0][0]
image.save("sd3.png")
- mindone diffusers is under active development, most tasks were tested with mindspore 2.5.0 on Ascend Atlas 800T A2 machines.
- compatibale with hf diffusers 0.32.2
component | features |
---|---|
pipeline | support text-to-image,text-to-video,text-to-audio tasks 160+ |
models | support audoencoder & transformers base models same as hf diffusers 50+ |
schedulers | support diffusion schedulers (e.g., ddpm and dpm solver) same as hf diffusers 35+ |
task | model | inference | finetune | pretrain | institute |
---|---|---|---|---|---|
Image-to-Video | hunyuanvideo-i2v 🔥🔥 | ✅ | ✖️ | ✖️ | Tencent |
Text/Image-to-Video | wan2.1 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Alibaba |
Text-to-Image | cogview4 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Zhipuai |
Text-to-Video | step_video_t2v 🔥🔥 | ✅ | ✖️ | ✖️ | StepFun |
Image-Text-to-Text | qwen2_vl 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Alibaba |
Any-to-Any | janus 🔥🔥🔥 | ✅ | ✅ | ✅ | DeepSeek |
Any-to-Any | emu3 🔥🔥 | ✅ | ✅ | ✅ | BAAI |
Class-to-Image | var🔥🔥 | ✅ | ✅ | ✅ | ByteDance |
Text/Image-to-Video | hpcai open sora 1.2/2.0 🔥🔥 | ✅ | ✅ | ✅ | HPC-AI Tech |
Text/Image-to-Video | cogvideox 1.5 5B~30B 🔥🔥 | ✅ | ✅ | ✅ | Zhipu |
Text-to-Video | open sora plan 1.3 🔥🔥 | ✅ | ✅ | ✅ | PKU |
Text-to-Video | hunyuanvideo 🔥🔥 | ✅ | ✅ | ✅ | Tencent |
Text-to-Video | movie gen 30B 🔥🔥 | ✅ | ✅ | ✅ | Meta |
Video-Encode-Decode | magvit | ✅ | ✅ | ✅ | |
Text-to-Image | story_diffusion | ✅ | ✖️ | ✖️ | ByteDance |
Image-to-Video | dynamicrafter | ✅ | ✖️ | ✖️ | Tencent |
Video-to-Video | venhancer | ✅ | ✖️ | ✖️ | Shanghai AI Lab |
Text-to-Video | t2v_turbo | ✅ | ✅ | ✅ | |
Image-to-Video | svd | ✅ | ✅ | ✅ | Stability AI |
Text-to-Video | animate diff | ✅ | ✅ | ✅ | CUHK |
Text/Image-to-Video | video composer | ✅ | ✅ | ✅ | Alibaba |
Text-to-Image | flux 🔥 | ✅ | ✅ | ✖️ | Black Forest Lab |
Text-to-Image | stable diffusion 3 🔥 | ✅ | ✅ | ✖️ | Stability AI |
Text-to-Image | kohya_sd_scripts | ✅ | ✅ | ✖️ | kohya |
Text-to-Image | stable diffusion xl | ✅ | ✅ | ✅ | Stability AI |
Text-to-Image | stable diffusion | ✅ | ✅ | ✅ | Stability AI |
Text-to-Image | hunyuan_dit | ✅ | ✅ | ✅ | Tencent |
Text-to-Image | pixart_sigma | ✅ | ✅ | ✅ | Huawei |
Text-to-Image | fit | ✅ | ✅ | ✅ | Shanghai AI Lab |
Class-to-Video | latte | ✅ | ✅ | ✅ | Shanghai AI Lab |
Class-to-Image | dit | ✅ | ✅ | ✅ | Meta |
Text-to-Image | t2i-adapter | ✅ | ✅ | ✅ | Shanghai AI Lab |
Text-to-Image | ip adapter | ✅ | ✅ | ✅ | Tencent |
Text-to-3D | mvdream | ✅ | ✅ | ✅ | ByteDance |
Image-to-3D | instantmesh | ✅ | ✅ | ✅ | Tencent |
Image-to-3D | sv3d | ✅ | ✅ | ✅ | Stability AI |
Text/Image-to-3D | hunyuan3d-1.0 | ✅ | ✅ | ✅ | Tencent |
task | model | inference | finetune | pretrain | features |
---|---|---|---|---|---|
Image-Text-to-Text | pllava 🔥 | ✅ | ✖️ | ✖️ | support video and image captioning |