Skip to content

Kwai-YuanQi/MM-RLHF

Repository files navigation

The Next Step Forward in Multimodal LLM Alignment

[2025/02/27] 🔥 MM-RLHF is now fully supported by swift. Simply process your data using scripts/convert_to_swift.py and execute the scripts/swift.sh command to get started.

[2025/02/10] 🔥 We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:

  • A high-quality MLLM alignment dataset.
  • A strong Critique-Based MLLM reward model and its training algorithm.
  • A novel alignment algorithm MM-DPO.
  • Two new benchmarks.

Our dataset and algorithms enable consistent performance improvements across 10 dimensions and 27 benchmarks for open-source MLLMs.

Key Components

1. MM-RLHF Dataset (data.jsonl in MM-RLHF Data)

  • 20k instructions covering image understanding, video understanding, and safety-related tasks.
  • Each instruction includes 3-5 model-generated responses, along with human-annotated scores, rankings, and fine-grained textual feedback.
  • 80k comparison pairs derived from ranked samples for each instruction, suitable for RLHF training.

2. Critique-Based MLLM Reward Model

  • We release the MM-RLHF-Reward-7B, a Critique-Based Reward Model that generates critiques of candidate texts before assigning scores, offering enhanced interpretability and more informative feedback.
  • Includes the training algorithm for the reward model, enabling researchers to reproduce and extend our work.

3. MM-DPO Algorithm

  • Complete training code for MM-DPO, a novel alignment algorithm that achieves significant performance gains with simple adjustments to the DPO framework.

4. MM-RLHF Benchmarks

  • MM-RLHF-RewardBench: Evaluates the quality of reward models.
  • MM-RLHF-SafetyBench: Focuses on MLLM safety, including tasks like adversarial attacks, red teaming, jailbreaking, and harmful content detection.

Models & Scripts

Installation

1. Clone this repository and navigate to the LLaVA folder:

git clone https://github.com/yfzhang114/MM-RLHF
cd MM-RLHF

Install the inference package:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

2. Data Preparation

Download 📊 MM-RLHF Data, unzip the image and video datasets, and the final structure should look like this:

MM-RLHF
-- | long
-- | mcq
-- | safety
-- | short
-- | data.jsonl
-- | dpo_pairs.jsonl

Here, data.jsonl contains all labeled information, and dpo_pairs.jsonl contains pairs of different ranks used for subsequent DPO and reward model training. The other folders contain the image and video frames.

3. Critique-Based MLLM Reward Model:

Specify the model to learn the critic and provide the reward's loss weight, such as critic_rewards_weight and float_rewards_weight.

sh scripts/train/critic_reward_7b.sh

4. MM-DPO Training:

Step 1: Precompute Logits with Reference Model

To save GPU memory during DPO training, precompute logits using the reference model. Specify DATA_PATH and OUTPUT_DATA_PATH in the script.

sh scripts/train/generate_ref_logits.sh

The output file will add elements like "reference_chosen_logp" and "reference_rejected_logp" to the data.

Step 2: Run DPO Algorithm with Precomputed Logits

sh scripts/train/dpo_ov7b.sh

5. Evaluation

For alignment models, the evaluation code is available in the mmrlhf-eval repository. This code provides various evaluation tasks that test the alignment capabilities of your model across multiple benchmarks. These benchmarks include standard tests for model robustness, safety, and hallucination handling in multimodal contexts.

For reward models, we offer the MM-RLHF-RewardBench (available on Hugging Face at MM-RLHF-RewardBench) for detailed evaluation. To perform the evaluation, download the required images and the mm_reward_bench.jsonl file from the repository and place them in the appropriate directories. Then, follow these steps:

  1. Download the dataset and necessary files:

    • Ensure the repository contains the image data and the mm_reward_bench.jsonl file in the designated folder (path_to_data/).
  2. Run the reward model evaluation: This step involves evaluating the reward model against the dataset by running the following command:

    python llava/eval/eval_mm_reward_bench.py --model-path your_reward_model --question-file path_to_data/mm_reward_bench.jsonl --answers-file your_answer_file
  3. Calculate performance metrics: After generating the answer file, you can calculate the performance of your reward model by running the following command:

    python llava/eval/cal_performance_mmreward_bench.py --input_file your_answer_file

Citation

If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

@article{zhang2025mm,
  title={MM-RLHF: The Next Step Forward in Multimodal LLM Alignment},
  author={Zhang, Yi-Fan and Yu, Tao and Tian, Haochen and Fu, Chaoyou and Li, Peiyan and Zeng, Jianshu and Xie, Wulin and Shi, Yang and Zhang, Huanyu and Wu, Junkang and others},
  journal={arXiv preprint arXiv:2502.10391},
  year={2025}
}

Related Projects

About

The Next Step Forward in Multimodal LLM Alignment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published