Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy
Harmon is a novel unified framework for multimodal understanding and generation. Unlike existing state-of-the-art architectures that disentangle visual understanding and generation with different encoder models, the proposed framework harmonizes the visual presentations of understanding and generation via a shared MAR encoder. Harmon achieves advanced generation performance on mainstream text-to-image generation benchmarks, and exhibits competitive results on multimodal understanding tasks. In this repo, we provide inference code to run Harmon for image understanding (image-to-text) and text-to-image generation, with two model variants Harmon-0.5B and Harmon-1.5B.
Task | Status |
---|---|
🛠️ Inference Code & Model Checkpoints | ✅ Released |
🌐 Project Page | ✅ Finished |
🤗 Online Demo | ✅ Finished |
🔄 Finetuning Code | ✅ Released |
mmengine
transformers==4.45.2
timm==0.9.12
flash_attn==2.3.4
Download the model checkpoints from 🤗 wusize/harmon and organize them as follows:
Harmon/
├── checkpoints
├── kl16.ckpt
├── harmon_0.5b.pth
├── harmon_1.5b.pth
It is recommended to use the following command to download the checkpoints
# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/harmon --local-dir checkpoints --repo-type model
export PYTHONPATH=./:$PYTHONPATH
python scripts/image2text.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
--checkpoint checkpoints/harmon_1.5b.pth --image_size 512 \
--image data/view.jpg --prompt "Describe the image in detail."
You can generate images from text prompts using the following command:
export PYTHONPATH=./:$PYTHONPATH
python scripts/text2image.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
--checkpoint checkpoints/harmon_1.5b.pth --image_size 512 \
--prompt 'a dog on the left and a cat on the right.' --output output.jpg
To generate a list of images based on prompts in a json file.
export PYTHONPATH=./:$PYTHONPATH
accelerate launch scripts/batch_text2image.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
--checkpoint checkpoints/harmon_1.5b.pth --image_size 512 \
--data path/to/xxx.json --output output --batch_size 4 --grid_size 2
The json file should look like:
[
{
"prompt": "a dog on the left and a cat on the right."
}
]
We have also converted our models to Huggingface format. You can directly load Harmon models from Huggingface using the transformers
library:
from transformers import AutoTokenizer, AutoModel
harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-0_5B",
trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-0_5B",
trust_remote_code=True).eval().cuda().bfloat16()
For more information on the usage of HF-based models, refer to the model cards in
Model Variant | LLM | MAR | Hugging Face Hub |
---|---|---|---|
Harmon-0.5B | Qwen2.5-0.5B-Instruct | MAR-Base | |
Harmon-1.5B | Qwen2.5-1.5B-Instruct | MAR-Huge |
For instructions on how to finetune Harmon models on your custom datasets, please refer to our detailed guide in FINETUNE.md.
If you find Harmon useful for your research or applications, please cite our paper using the following BibTeX:
@misc{wu2025harmon,
title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation},
author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy},
year={2025},
eprint={2503.21979},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.21979},
}
This project is licensed under NTU S-Lab License 1.0.
The project builds upon the following open-source efforts: