This repository provides a script and recipe to train the SSD300 model to achieve state of the art accuracy. The codes of model architecture comes from NVIDIA(NVIDIA SSD pytorch), and the method belongs to paper SSD: Single Shot MultiBox Detector .
Copy from NVIDIA SSD pytorch
The following features are supported by this model.
Feature | SSD300 PyTorch |
AMP | Yes |
APEX DDP | Yes |
AMP is an abbreviation used for automatic mixed precision training.
DDP stands for DistributedDataParallel and is used for multi-GPU training.
NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader with the DALI library. For details, see example sources in this repo or see the DALI documentation
- Clone the repository.
git clone
git clone
Download and preprocess the dataset.
The SSD model was trained on the COCO 2017 dataset. You can download the dataset on COCO Download.
NOTE: Make the dataset root like below:
└── COCO ├──images ├── train2017: All train images(118287 images) ├── val2017: All validate images(5000 images) ├── annotations ├── instances_train2017.json ├── instances_val2017.json ├── captions_train2017.json ├── captions_val2017.json ├── person_keypoints_train2017.json └── person_keypoints_val2017.json └── coco_labels.txt
Config Setting
Set the config in the
Train the model.(Unnecessary, you can download the pretrained checkpoint.)
- Single GPU
sh scripts/
- Multi GPU
sh scripts/
Evaluate the model on the COCO dataset.
Just run the
You can download the Checkpoint in the
And I also put them in the Google Drive, you can download them from
(Forgive me because of BaiDu Yun is too slow.)
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.424
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.255
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.074
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.268
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.400
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.237
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.344
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.359
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.116
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.392
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.550
Model FLOPs: 20.213G Params: 22.895M
Model summary: 173 layers, 22894902 parameters, 22894902 gradients
Copy from NVIDIA SSD pytorch
Before we feed data to the model, both during training and inference, we perform:
- JPEG decoding
- normalization with a mean =
[0.485, 0.456, 0.406]
and std dev =[0.229, 0.224, 0.225]
- encoding bounding boxes
- resizing to 300x300
Additionally, during training, data is:
- randomly shuffled
- samples without annotations are skipped
During training we perform the following augmentation techniques:
- Random crop using the algorithm described in the SSD: Single Shot MultiBox Detector paper
- Random horizontal flip
- Color jitter