Official implementation of 'MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model'
- [2025-01] The initial code of MonoDINIO-DETR on KITTI is released! 💥
MonoDINO-DETR is a monocular 3D object detection model and the first to employ a vision foundation model backbone within a DETR architecture. By leveraging the DINOv2 backbone's ability to extract generalized features across diverse vision tasks, it improves the model's performance in both depth and visual feature extraction. For visual feature extraction, the model integrates a Hierarchical Feature Fusion Block for multi-scale detection and 6D Dynamic Anchor Boxes for iterative bounding box refinement, achieving enhanced performance without relying on additional data sources such as LiDAR. For depth features, DINOv2 is paired with a DPT head, which is based on the architecture of Depth Anything V2—a large-scale pre-trained model for relative depth estimation. By utilizing the pre-trained weights of Depth Anything V2 along with an additional CNN layer, the model achieves better absolute depth estimation, which in turn enhances 3D object detection performance.
- Overall Structure of MonoDINO-DETR
- Overall Structure of Feature Extraction Module
For training, we used 4 NVIDIA TITAN RTX GPUs for 195 epochs with a batch size of 8.
Table Description:
Mod.
indicates the moderate difficulty level.- Bold numbers highlight the best results, underlined numbers indicate the second-best results, and blue numbers represent the improvement over them.
- Since CaDDN uses a substantial amount of GPU memory, the batch size is set to 2 per GPU across 4 GPUs for CaDDN, and 8 for other models.
Method | Extra Data | Val, AP3D | Time (ms) | ||
Easy | Mod. | Hard | |||
CaDDN (CVPR 2021) | LiDAR | 21.91 | 15.28 | 13.66 | - |
MonoDTR (CVPR 2022) | LiDAR | 23.92 | 18.76 | 15.81 | - |
MonoGround (CVPR 2022) | None | 19.78 | 14.46 | 12.42 | 42 |
MonoDETR (ICCV 2023) | None | 24.29 | 17.52 | 15.28 | 23 |
MonoCD (CVPR 2024) | Planes | 21.39 | 15.86 | 13.09 | 35 |
MonoDINO-DETR | None | 26.72 | 19.19 | 15.92 | 66 |
MonoDINO-DETR + DAB | None | 27.93 | 19.39 | 15.97 | 74 |
-
Clone this project and create a conda environment:
git clone git@github.com:JihyeokKim/MonoDINO-DETR.git cd MonoDINO-DETR conda create -n monodinodetr python=3.8 conda activate monodinodetr
-
Install pytorch and torchvision matching your CUDA version:
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
-
Install requirements and compile the deformable attention:
pip install -r requirements.txt cd lib/models/monodinodetr/ops/ bash make.sh cd ../../../..
-
Make dictionary for saving training losses:
mkdir logs
-
Download KITTI datasets and prepare the directory structure as:
│MonoDINO-DETR/ ├──... ├──data/KITTIDataset/ │ ├──ImageSets/ │ ├──training/ │ ├──testing/ ├──...
You can also change the data path at "dataset/root_dir" in
configs/monodinodetr.yaml
. -
Download pre-trained Depth Anytning V2 model in
checkpoints
.
You can modify the settings of models and training in configs/monodinodetr.yaml
and indicate the GPU in train.sh
:
bash train.sh configs/monodinodetr.yaml > logs/monodinodetr.log
You can train the model with multiple GPUs:
bash train.sh configs/monodinodetr.yaml --batch_size 32 --num_gpus 4 > logs/monodinodetr_multi_gpu32.log
The best checkpoint will be evaluated as default. You can change it at "tester/checkpoint" in configs/monodinodetr.yaml
:
bash test.sh configs/monodinodetr.yaml
This repo benefits from the excellent MonoDETR and Depth-Anything-V2.
If you find this project useful, please consider citing:
@misc{kim2025monodinodetrdepthenhancedmonocular3d,
title={MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model},
author={Jihyeok Kim and Seongwoo Moon and Sungwon Nah and David Hyunchul Shim},
year={2025},
eprint={2502.00315},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.00315},
}
If you have any questions about this project, please feel free to contact hanaone110@gmail.com.