Skip to content

MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

Notifications You must be signed in to change notification settings

JihyeokKim/MonoDINO-DETR

Repository files navigation

MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

Official implementation of 'MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model'

News

  • [2025-01] The initial code of MonoDINIO-DETR on KITTI is released! 💥

Introduction

MonoDINO-DETR is a monocular 3D object detection model and the first to employ a vision foundation model backbone within a DETR architecture. By leveraging the DINOv2 backbone's ability to extract generalized features across diverse vision tasks, it improves the model's performance in both depth and visual feature extraction. For visual feature extraction, the model integrates a Hierarchical Feature Fusion Block for multi-scale detection and 6D Dynamic Anchor Boxes for iterative bounding box refinement, achieving enhanced performance without relying on additional data sources such as LiDAR. For depth features, DINOv2 is paired with a DPT head, which is based on the architecture of Depth Anything V2—a large-scale pre-trained model for relative depth estimation. By utilizing the pre-trained weights of Depth Anything V2 along with an additional CNN layer, the model achieves better absolute depth estimation, which in turn enhances 3D object detection performance.

  • Overall Structure of MonoDINO-DETR
  • Overall Structure of Feature Extraction Module

Main Results

For training, we used 4 NVIDIA TITAN RTX GPUs for 195 epochs with a batch size of 8.

Comparison of our model with state-of-the-art models on KITTI val. set for the car class

Table Description:

  • Mod. indicates the moderate difficulty level.
  • Bold numbers highlight the best results, underlined numbers indicate the second-best results, and blue numbers represent the improvement over them.
  • Since CaDDN uses a substantial amount of GPU memory, the batch size is set to 2 per GPU across 4 GPUs for CaDDN, and 8 for other models.
Method Extra Data Val, AP3D Time (ms)
Easy Mod. Hard
CaDDN (CVPR 2021) LiDAR 21.91 15.28 13.66 -
MonoDTR (CVPR 2022) LiDAR 23.92 18.76 15.81 -
MonoGround (CVPR 2022) None 19.78 14.46 12.42 42
MonoDETR (ICCV 2023) None 24.29 17.52 15.28 23
MonoCD (CVPR 2024) Planes 21.39 15.86 13.09 35
MonoDINO-DETR None 26.72 19.19 15.92 66
MonoDINO-DETR + DAB None 27.93 19.39 15.97 74

Installation

  1. Clone this project and create a conda environment:

    git clone git@github.com:JihyeokKim/MonoDINO-DETR.git
    cd MonoDINO-DETR
    
    conda create -n monodinodetr python=3.8
    conda activate monodinodetr
    
  2. Install pytorch and torchvision matching your CUDA version:

    conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
  3. Install requirements and compile the deformable attention:

    pip install -r requirements.txt
    
    cd lib/models/monodinodetr/ops/
    bash make.sh
    
    cd ../../../..
    
  4. Make dictionary for saving training losses:

    mkdir logs
    
  5. Download KITTI datasets and prepare the directory structure as:

    │MonoDINO-DETR/
    ├──...
    ├──data/KITTIDataset/
    │   ├──ImageSets/
    │   ├──training/
    │   ├──testing/
    ├──...
    

    You can also change the data path at "dataset/root_dir" in configs/monodinodetr.yaml.

  6. Download pre-trained Depth Anytning V2 model in checkpoints.

Get Started

Train

You can modify the settings of models and training in configs/monodinodetr.yaml and indicate the GPU in train.sh:

bash train.sh configs/monodinodetr.yaml > logs/monodinodetr.log

You can train the model with multiple GPUs:

bash train.sh configs/monodinodetr.yaml --batch_size 32 --num_gpus 4 > logs/monodinodetr_multi_gpu32.log

Test

The best checkpoint will be evaluated as default. You can change it at "tester/checkpoint" in configs/monodinodetr.yaml:

bash test.sh configs/monodinodetr.yaml

Acknowlegment

This repo benefits from the excellent MonoDETR and Depth-Anything-V2.

Citation

If you find this project useful, please consider citing:

@misc{kim2025monodinodetrdepthenhancedmonocular3d,
      title={MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model}, 
      author={Jihyeok Kim and Seongwoo Moon and Sungwon Nah and David Hyunchul Shim},
      year={2025},
      eprint={2502.00315},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.00315}, 
}

Contact

If you have any questions about this project, please feel free to contact hanaone110@gmail.com.

About

MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages