This repository contains SoTA algorithms, models, and interesting projects in the area of multimodal understanding and content generation.
ONE is short for "ONE for all"
- [2025.04.10] We release MindONE v0.3.0. More than 15 SoTA generative models are added, including Flux, CogView4, OpenSora2.0, Movie Gen 30B , CogVideoX 5B~30B. Have fun!
- [2025.02.21] We support DeepSeek Janus-Pro, a SoTA multimodal understanding and generation model. See here
- [2024.11.06] MindONE v0.2.0 is released
To install MindONE v0.3.0, please install MindSpore 2.5.0 and run pip install mindone
Alternatively, to install the latest version from the master
branch, please run.
git clone https://github.com/mindspore-lab/mindone.git
cd mindone
pip install -e .
We support state-of-the-art diffusion models for generating images, audio, and video. Let's get started using Stable Diffusion 3 as an example.
Hello MindSpore from Stable Diffusion 3!
import mindspore
from mindone.diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
mindspore_dtype=mindspore.float16,
)
prompt = "A cat holding a sign that says 'Hello MindSpore'"
image = pipe(prompt)[0][0]
image.save("sd3.png")
- mindone diffusers is under active development, most tasks were tested with mindspore 2.5.0 on Ascend Atlas 800T A2 machines.
- compatibale with hf diffusers 0.32.2
component | features |
---|---|
pipeline | support text-to-image,text-to-video,text-to-audio tasks 160+ |
models | support audoencoder & transformers base models same as hf diffusers 50+ |
schedulers | support diffusion schedulers (e.g., ddpm and dpm solver) same as hf diffusers 35+ |
task | model | inference | finetune | pretrain | institute |
---|---|---|---|---|---|
Image-to-Video | hunyuanvideo-i2v 🔥🔥 | ✅ | ✖️ | ✖️ | Tencent |
Text/Image-to-Video | wan2.1 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Alibaba |
Text-to-Image | cogview4 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Zhipuai |
Text-to-Video | step_video_t2v 🔥🔥 | ✅ | ✖️ | ✖️ | StepFun |
Image-Text-to-Text | qwen2_vl 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Alibaba |
Any-to-Any | janus 🔥🔥🔥 | ✅ | ✅ | ✅ | DeepSeek |
Any-to-Any | emu3 🔥🔥 | ✅ | ✅ | ✅ | BAAI |
Class-to-Image | var🔥🔥 | ✅ | ✅ | ✅ | ByteDance |
Text/Image-to-Video | hpcai open sora 1.2/2.0 🔥🔥 | ✅ | ✅ | ✅ | HPC-AI Tech |
Text/Image-to-Video | cogvideox 1.5 5B~30B 🔥🔥 | ✅ | ✅ | ✅ | Zhipu |
Text-to-Video | open sora plan 1.3 🔥🔥 | ✅ | ✅ | ✅ | PKU |
Text-to-Video | hunyuanvideo 🔥🔥 | ✅ | ✅ | ✅ | Tencent |
Text-to-Video | movie gen 30B 🔥🔥 | ✅ | ✅ | ✅ | Meta |
Video-Encode-Decode | magvit | ✅ | ✅ | ✅ | |
Text-to-Image | story_diffusion | ✅ | ✖️ | ✖️ | ByteDance |
Image-to-Video | dynamicrafter | ✅ | ✖️ | ✖️ | Tencent |
Video-to-Video | venhancer | ✅ | ✖️ | ✖️ | Shanghai AI Lab |
Text-to-Video | t2v_turbo | ✅ | ✅ | ✅ | |
Image-to-Video | svd | ✅ | ✅ | ✅ | Stability AI |
Text-to-Video | animate diff | ✅ | ✅ | ✅ | CUHK |
Text/Image-to-Video | video composer | ✅ | ✅ | ✅ | Alibaba |
Text-to-Image | flux 🔥 | ✅ | ✅ | ✖️ | Black Forest Lab |
Text-to-Image | stable diffusion 3 🔥 | ✅ | ✅ | ✖️ | Stability AI |
Text-to-Image | kohya_sd_scripts | ✅ | ✅ | ✖️ | kohya |
Text-to-Image | stable diffusion xl | ✅ | ✅ | ✅ | Stability AI |
Text-to-Image | stable diffusion | ✅ | ✅ | ✅ | Stability AI |
Text-to-Image | hunyuan_dit | ✅ | ✅ | ✅ | Tencent |
Text-to-Image | pixart_sigma | ✅ | ✅ | ✅ | Huawei |
Text-to-Image | fit | ✅ | ✅ | ✅ | Shanghai AI Lab |
Class-to-Video | latte | ✅ | ✅ | ✅ | Shanghai AI Lab |
Class-to-Image | dit | ✅ | ✅ | ✅ | Meta |
Text-to-Image | t2i-adapter | ✅ | ✅ | ✅ | Shanghai AI Lab |
Text-to-Image | ip adapter | ✅ | ✅ | ✅ | Tencent |
Text-to-3D | mvdream | ✅ | ✅ | ✅ | ByteDance |
Image-to-3D | instantmesh | ✅ | ✅ | ✅ | Tencent |
Image-to-3D | sv3d | ✅ | ✅ | ✅ | Stability AI |
Text/Image-to-3D | hunyuan3d-1.0 | ✅ | ✅ | ✅ | Tencent |
task | model | inference | finetune | pretrain | features |
---|---|---|---|---|---|
Image-Text-to-Text | pllava 🔥 | ✅ | ✖️ | ✖️ | support video and image captioning |