Xiangyu Zhao*, Shengyuan Ding*, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Hua Yang, Haodong Duan, Kai Chen
- [2025/02] Our Paper, OmniAlign-V-SFT, OmniAlign-V-DPO, [Checkpoints]((https://huggingface.co/collections/PhoenixZ/omnialign-v-67b591ac7aaae267db319971) are all released.
- [2025/02] Our MM-AlignBench is now supported in VLMEvalKit, enabling quick and efficient evaluation of MLLMs.
In this work, we introduce three key contributions, OmniAlign-V SFT dataset, OmniAlign-V-DPO dataset, and MM-AlignBench:
- OmniAlign-V SFT Dataset: A SFT dataset designed to improve the alignment of Multi-modal Large Language Models (MLLMs) with human preferences. It contains 205k high-quality Image-Question-Answer pairs , featuring open-ended, creative questions and long, knowledge-rich, comprehensive answers.
- OmniAlign-V-DPO Dataset: A specialized dataset for Direct Preference Optimization (DPO). It leverages the answers from the OmniAlign-V SFT dataset as positive samples and generates negative samples using LLaVANext-InternLM-7B with rejection sampling.
- MM-AlignBench: A benchmark for evaluating MLLMs' alignment with human preferences. It includes 252 high-quality, human-annotated samples with diverse image types and open-ended questions. Modeled after Arena-style benchmarks, it uses GPT-4o as the judge model and Claude-Sonnet-3 as the reference model.
Our OmniAlign-V SFT dataset not only significantly improves the alignment of MLLMs with human preference, but also boosts the performance of MLLMs on common downstream tasks, particularly on benchmarks like MMVet and MMMU.
By incorporating a DPO stage using our OmniAlign-V-DPO dataset, we achieve even better alignment with human preferences. Notably, our LLaVANext-OA-32B model, built on the Qwen2.5-32B-Instruct foundation, surpasses Qwen2VL-72B on the MM-AlignBench.
MM-AlignBench is now supported in VLMEvalKit, a powerful toolkit for evaluating over 200 MLLMs across various benchmarks. For more details, check out the VLMEvalKit repository .
-
It is recommended to build a Python-3.10 virtual environment using conda
conda create --name omnialign-env python=3.10 -y conda activate omnialign-env
-
Install XTuner from source
git clone https://github.com/PhoenixZ810/MG-LLaVA.git cd OmniAlign-V pip install -e '.[all]'
We conduct our experiments on transformers==4.37.2
, torch=2.1.3
, cuda=12.1
, flash_attn=2.5.5
. It is recommended to use the same version to avoid potential issues.
- Pretrain Data: We use ALLaVA-pretrain and LLaVA-pretrain-558k as our pretrain data.
- SFT Data: We use the multi-modal data with in LLaVA-Next-SFT-778K and OmniAlign-V-SFT datasets in SFT stage.
- DPO: We only use OmniAlign-V-DPO in DPO stage.
If you want to use OmniAlign-V in sft and dpo stage, please put the data in below structure:
- playground
- data
- OmniAlign_V
- images
- knowledge
- knowledge_1.jpg
...
- OmniAlign_V_DPO
- images
...
Our codebase utilize meta_path.json
to flexiblely load different kind of data. You can modify the meta_path.json
to load the data like:
{
"LLaVANext": {
"root": "PATH_TO_DATA",
"annotation": "PATH_TO_ANNOTATION",
"data_augment": false,
"repeat_time": 1,
"length": LENGTH,
"data_type": "llava-next"
},
"OmniAlign_V_knowledge": {
"root": "playground/data",
"annotation": "playground/data/OmniAlign_V/knowledge.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 40813,
"data_type": "knowledge"
},
"OmniAlign_V_inferential": {
"root": "playground/data",
"annotation": "playground/data/OmniAlign_V/inferential.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 37117,
"data_type": "inferential"
},
...
}
Our checkpoints are available at HuggingFace ModelZoo.
-
LLaVANext-OmniAlign-7B is based on InternLM2.5-7B-chat
-
LLaVANext-OmniAlign-32B is based on Qwen2.5-32B-Instruct.
We employ CLIP-Large-336 as visual encoder. you should download both the LLM and CLIP checkpoints before training.
Before each training stage, you should modify the meta-data
, model_path
, name
Pretrain Stage
- Use this command to pretrain the model:
bash scripts/pretrain.sh
SFT Stage
Our code supports start SFT training and evaluation in one command by integrating VLMEvalKit in our repo.
- Specifically,
run.py
file from VLMEvalKit has been modified and renamed toeval_run.py
in this repository. - To use this feature, users must clone and install VLMEvalKit:
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
- Once installed, the following command starts both training and evaluation:
bash scripts/llavanext_anyres/sft_AR4_llavanext.sh
- More details of using VLMEvalKit can be found in VLMEvalKit repo.
DPO Stage
- Similar to the SFT stage, DPO training and evaluation can be started with a single command:
bash scripts/dpo/dpo_anyres.sh
Evaluation Only
- If you just want to evaluate the model, you can use the following command:
torchrun --nproc_per_node 8\
eval_run.py \
--data MMAlignBench \
--model YOUR_MODEL_NAME # Used to create save dir \
--path PATH_TO_CHECKPOINT \
--reuse \
If you find OmniAlign-V useful, please cite using this BibTeX:
@article{zhao2025omnialignvenhancedalignmentmllms,
title={OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference},
author={Xiangyu Zhao and Shengyuan Ding and Zicheng Zhang and Haian Huang and Maosong Cao and Weiyun Wang and Jiaqi Wang and Xinyu Fang and Wenhai Wang and Guangtao Zhai and Haodong Duan and Hua Yang and Kai Chen},
journal={arXiv preprint arXiv:2502.18411},
year={2024}
}
- LLaVA: Base model structure.
- InternVL: InternVL structure.
- VLMEvalkit: Evaluation tool.