Harmon: Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy

Introduction

Harmon is a novel unified framework for multimodal understanding and generation. Unlike existing state-of-the-art architectures that disentangle visual understanding and generation with different encoder models, the proposed framework harmonizes the visual presentations of understanding and generation via a shared MAR encoder. Harmon achieves advanced generation performance on mainstream text-to-image generation benchmarks, and exhibits competitive results on multimodal understanding tasks. In this repo, we provide inference code to run Harmon for image understanding (image-to-text) and text-to-image generation, with two model variants Harmon-0.5B and Harmon-1.5B.

🚀 Project Status

Task	Status
🛠️ Inference Code & Model Checkpoints	✅ Released
🌐 Project Page	✅ Finished
🤗 Online Demo	✅ Finished
🔄 Finetuning Code	✅ Released

Usage

📦 Required Packages

mmengine
transformers==4.45.2
timm==0.9.12
flash_attn==2.3.4

📥 Checkpoints

Download the model checkpoints from 🤗 wusize/harmon and organize them as follows:

Harmon/
├── checkpoints
    ├── kl16.ckpt
    ├── harmon_0.5b.pth
    ├── harmon_1.5b.pth

It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/harmon  --local-dir checkpoints --repo-type model

🖌️ Image-to-text Generation

export PYTHONPATH=./:$PYTHONPATH
python scripts/image2text.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
         --checkpoint checkpoints/harmon_1.5b.pth  --image_size 512 \
         --image data/view.jpg --prompt "Describe the image in detail."

🖼️ Text-to-image Generation

You can generate images from text prompts using the following command:

export PYTHONPATH=./:$PYTHONPATH
python scripts/text2image.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
         --checkpoint checkpoints/harmon_1.5b.pth  --image_size 512 \
         --prompt 'a dog on the left and a cat on the right.'  --output output.jpg

To generate a list of images based on prompts in a json file.

export PYTHONPATH=./:$PYTHONPATH
accelerate launch scripts/batch_text2image.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
       --checkpoint checkpoints/harmon_1.5b.pth  --image_size 512 \
       --data path/to/xxx.json --output output --batch_size 4 --grid_size 2

The json file should look like:

[
  {
   "prompt": "a dog on the left and a cat on the right."
  }
]

🤗 Loading Models from Huggingface

We have also converted our models to Huggingface format. You can directly load Harmon models from Huggingface using the transformers library:

from transformers import AutoTokenizer, AutoModel
harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-0_5B",
                                                 trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-0_5B",
                                         trust_remote_code=True).eval().cuda().bfloat16()

For more information on the usage of HF-based models, refer to the model cards in

Model Variant	LLM	MAR	Hugging Face Hub
Harmon-0.5B	Qwen2.5-0.5B-Instruct	MAR-Base
Harmon-1.5B	Qwen2.5-1.5B-Instruct	MAR-Huge

🔄 Finetuning Harmon

For instructions on how to finetune Harmon models on your custom datasets, please refer to our detailed guide in FINETUNE.md.

📚 Citation

If you find Harmon useful for your research or applications, please cite our paper using the following BibTeX:

@misc{wu2025harmon,
      title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2503.21979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21979}, 
}

📜 License

This project is licensed under NTU S-Lab License 1.0.

🙏 Acknowledgement

The project builds upon the following open-source efforts:

Qwen2.5: We use LLMs from Qwen2.5, including Qwen2.5-0.5B-Instruct and Qwen2.5-1.5B-Instruct.
MAR: The image generation pipeline is retrofitted from MAR.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
data		data
scripts		scripts
src		src
.gitignore		.gitignore
FINETUNE.md		FINETUNE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harmon: Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Introduction

🚀 Project Status

Usage

📦 Required Packages

📥 Checkpoints

🖌️ Image-to-text Generation

🖼️ Text-to-image Generation

🤗 Loading Models from Huggingface

🔄 Finetuning Harmon

📚 Citation

📜 License

🙏 Acknowledgement

About

Releases

Packages

Languages

License

wusize/Harmon

Folders and files

Latest commit

History

Repository files navigation

Harmon: Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Introduction

🚀 Project Status

Usage

📦 Required Packages

📥 Checkpoints

🖌️ Image-to-text Generation

🖼️ Text-to-image Generation

🤗 Loading Models from Huggingface

🔄 Finetuning Harmon

📚 Citation

📜 License

🙏 Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages