Skip to content
/ VLM-R1 Public
forked from om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs(将 DeepSeek 的 R1 方法从纯文本领域成功迁移到了视觉语言领域,打开了对于多模态领域的想象空间!)

Notifications You must be signed in to change notification settings

Dingyen/VLM-R1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLM-R1: A stable and generalizable R1-style Large Vision-Language Model

Since the introduction of Deepseek-R1, numerous works have emerged focusing on reproducing and improving upon it. In this project, we propose VLM-R1, a stable and generalizable R1-style Large Vision-Language Model.

Specifically, for the task of Referring Expression Comprehension (REC), we trained Qwen2.5-VL using both R1 and SFT approaches. The results reveal that, on the in-domain test data, the performance of the SFT model is slightly lower than that of the R1 model (as shown at the top of the figure above). However, on the out-of-domain test data, the SFT model’s performance deteriorates significantly as the number of steps increases, while the R1 model shows a steady improvement (as shown at the bottom of the figure above).

Setup

conda create -n vlm-r1 python=3.10
conda activate vlm-r1
bash setup.sh

Training

Referring Expression Comprehension (REC)

GRPO

  1. Download the COCO Train2014 image and unzip it, and we refer to the image dir as <your_image_root>.
  1. Download the RefCOCO/+/g and RefGTA Annotation files and unzip it (RefGTA is used for out-of-domain evaluation).
  1. Write the path of the annotation files in the src/open-r1-multimodal/data_config/rec.yaml file.
datasets:
    - json_path: /path/to/refcoco_train.json
    - json_path: /path/to/refcocop_train.json
    - json_path: /path/to/refcocog_train.json
  1. bash src/open-r1-multimodal/run_grpo_rec.sh
cd src/open-r1-multimodal

torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12346" \
    src/open_r1/grpo_rec.py \
    --deepspeed local_scripts/zero3.json \
    --output_dir output/$RUN_NAME \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --dataset_name data_config/rec.yaml \
    --image_root <your_image_root> \
    --max_prompt_length 1024 \
    --num_generations 8 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --torch_dtype bfloat16 \
    --data_seed 42 \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --num_train_epochs 2 \
    --run_name $RUN_NAME \
    --save_steps 100 \
    --save_only_model true

image image

SFT

We use LLaMA-Factory to train the SFT model.

  1. Clone the LLaMA-Factory repository and install the dependencies.
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
  1. Download the dataset_info.json, mllm_rec_json.json, and qwen2_5_vl_full_sft.yaml we provided here. Put the json files in the LLaMA-Factory/data directory and the yaml file in the LLaMA-Factory/examples/train_full directory.
  1. Run the following command to train the SFT model.
llamafactory-cli train examples/train_full/qwen2_5_vl_full_sft.yaml

Evaluation

image

  1. Download the provided RefGTA images.
cd ./src/eval

# Remember to change the model path, image root, and annotation path in the script
python test_rec_r1.py # for GRPO
python test_rec_baseline.py # for SFT

Acknowledgements

We would like to express our sincere gratitude to DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V, RefCOCO, and RefGTA for providing open-source resources that contributed to the development of this project.

Citation

If you find this project useful, welcome to cite us.

@misc{shen2025vlmr1,
  author       = {Shen, Haozhan and Zhang, Zilun and Zhang, Qianqian and Xu, Ruochen and Zhao, Tiancheng},
  title        = {VLM-R1: A stable and generalizable R1-style Large Vision-Language Model},
  howpublished = {\url{https://github.com/om-ai-lab/VLM-R1}},
  note         = {Accessed: 2025-02-15},
  year         = {2025}
}

About

Solve Visual Understanding with Reinforced VLMs(将 DeepSeek 的 R1 方法从纯文本领域成功迁移到了视觉语言领域,打开了对于多模态领域的想象空间!)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.5%
  • Shell 4.5%
  • Other 1.0%