-
Notifications
You must be signed in to change notification settings - Fork 0
object_detection_and_training_wiki
Object detection using darknet_ros_py package
This package performs the YOLOv3 object detection and identification.
It allows for toggling on/off the YOLO detector and waits idle for input images.
The steps to train the neural network, in order to perform the detections later are the following:
- Collect samples of images containing the objects to be detected (the training dataset).
- Label the images by using a tool. Typically you draw a bounding box around the object(s) of interest, and the labelling tool generates a text file that contains the coordinates of the bounding boxes and its label id.
- Augment the dataset by generating more images and labels from the labelled ones, and using adversarial examples to improve the neural network.
- Configuration files are the files that YOLOv3 needs to configure the YOLO training and also index and read the images and labels.
- Train the neural network: Once the dataset is ready with images and labels, it can be fed to a training algorithm that performs the training. Typically the training is a computationally intensive operation that even if accelerated by a GPU or TPU takes several hours to converge and get acceptable results. 5.1. ** Training in google colab **
- ** Dataset **
Previous guide for Training: http://dante.isr.tecnico.ulisboa.pt/socrob_at_home/isr_monarch_robot/blob/kinetic/mbot_tools/data_augmentation/README.md
Darknet ROS README: http://dante.isr.tecnico.ulisboa.pt/socrob_at_home/isr_tiago/-/blob/ros-melodic/perception_packages/darknet_ros_py/ros/binaries/darknet_gpu/README.md
To compile the darknet_ros_py go to the binaries folder:
isr_tiago/perception_packages/darknet_ros_py/ros/binaries/darknet_gpu
then do:
make clean
open the Makefile and change the ARCH flag to match your CUDA version and GPU architecture. You can find this in: https://en.wikipedia.org/wiki/CUDA#Supported_GPUs
check the other flags and set them according to your setup requirements:
-
GPU=1
to build with CUDA to accelerate by using GPU (CUDA should be in/usr/local/cuda
) -
CUDNN=1
to build with cuDNN v5-v7 to accelerate training by using GPU (cuDNN should be in/usr/local/cudnn
) -
CUDNN_HALF=1
to build for Tensor Cores (on Titan V / Tesla V100 / DGX-2 and later) speedup Detection 3x, Training 2x -
OPENCV=1
to build with OpenCV 4.x/3.x/2.4.x - allows to detect on video files and video streams from network cameras or web-cams -
DEBUG=1
to build debug version of Yolo -
OPENMP=1
to build with OpenMP support to accelerate Yolo by using multi-core CPU -
LIBSO=1
to build a librarydarknet.so
and binary runnable fileuselib
that uses this library. Or you can try to run soLD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./uselib test.mp4
How to use this SO-library from your own code - you can look at C++ example: https://github.com/AlexeyAB/darknet/blob/master/src/yolo_console_dll.cpp or use in such a way:LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./uselib data/coco.names cfg/yolov4.cfg yolov4.weights test.mp4
-
ZED_CAMERA=1
to build a library with ZED-3D-camera support (should be ZED SDK installed), then runLD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./uselib data/coco.names cfg/yolov4.cfg yolov4.weights zed_camera
then do:
make
Using the robot's camera (real or simulated), acquire the maximum amount of images per object, taking the following aspects into consideration:
-
Single objects:
-
Object pose relative to the camera: translations and rotations in 3d. E.g.: objects on the top/bottom/left/right/center of the frame, camera angles and object angles.
-
Diversity of backgrounds: camera angles and objects positions in the environment. Basically the pixel information surrounding the objects should change.
-
Occlusions of different parts of the object: Forcefully capture partial images of the object. They can be simply occluded by other components of the environment (including the robotic arm and gripper), or on the edges of the camera frame.
-
Illumination
-
-
Multiple objects:
-
All of the considerations above, but now with multiple objects per frame
-
Capture images with the maximum amount of objects per frame, ideally, if the number of objects and dimensions allow, try to capture them all in the same frame. Can be even repeated instances of the same object in different positions.
-
Mix and match the occlusions between objects.
-
This is a tradeoff between labelling time and precision of detections.
-
If the capture/detection will happen in a simulated environment, there are scripts for gazebo to spawn the objects in different locations and also different robot camera poses.
After the images are acquired, they should be well organized, named and divided for the labelling task ahead.
NOTES
-
Depending on where the detections will be performed, the dataset should contain images which are similar to the environment where the detections will occur (AGI is not a thing yet). Aspects to take into consideration are:
-
simulation vs real-world
-
camera and its properties (same camera, lenses, fov, exposure, etc)
-
illumination of the environment
-
relative angles to the objects (3d)
-
object visual persistence (shape, colors, text, etc)
-
occlusions
-
multiple objects per image
-
-
So for now, if the competition is in simulation, the dataset should contain only simulation acquired images, and for real-world competitions, images of the real objects acquired via the robot's camera.
-
A word about overfitting: overfitting is when the neural network has a very narrow performance, failing very easily outside the conditions of the training dataset. Counterintuitive as it is, this is actually desirable for real world results, meaning, for research the goal is generally to evolve the SoA, for robotics competitions the goal is to perform optimally on the tasks and win the competitions, where one of the biggest constraints is time. So basically, it is desirable to perform the detections as perfect as possible, sacrificing generality, since downstream the perception, planning, manipulation stacks rely heavily on good object detections.
-
Some of the above aspects are being mitigated by the state-of-the-art in generalized perception research. For instance, recent results with photorealistic simulators are well translated to detections in the real world. For the moment, this is not a reality in robocup/ERL (non-photorealistic simulator) but it's a hint for the future.
For yolo training, the labelling is "simply" the generation of a txt file that complements the image. Typically they have the same name, just different extensions (jpg/png, txt). The txt file contains, on each line, an id and the coordinates of the pixels that define the bounding boxes (X_CENTER_NORM Y_CENTER_NORM WIDTH_NORM HEIGHT_NORM). So, 3 possible cases per image:
- No object in image -> empty txt file
- 1 object in image -> txt with 1 line, identifying the object and the above specified coordinates
- n objects in image -> txt with n lines
Keep in mind that depending on the labelling tool, it might generate or not the txt file if you skip one that does not contain the object. It is important to keep the empty txt files and images, as they serve as adversarial examples (explained below).
Recommended tools for labelling:
- YOLO-Label Windows only
- Yolo_mark
- LabelImg
Before you start the labelling effort, you need a dictionary file that makes the correspondence between the id and the label: Example. This file is usually requested in the labelling tool.
TIPS
-
Since each tool is different, read the instructions of the tool you will use and practice the shortcuts.
-
The faster you go, the more samples you get, the better convergence the neural network gets. Remember the time constraints, as labelling is a repetitive and mechanical task, that should be done with the maximum precision possible.
-
IN THE TRADEOFF BETWEEN QUALITY AND QUANTITY OF LABELLED IMAGES, IT'S ALWAYS BETTER QUALITY. Meaning: better have a few good labels than many bad ones. Augmentation takes care of one aspect of the quantity anyway. At the limit, 0 pixel tolerance between the boundary of the bounding box and the object limits.
-
Keep strong, divide work between colleagues, save your eyesight by taking frequent breaks and don't keep staring at the same images for too long, it might give you side effects like vertigo or labelling nightmares :)
-
But hey, someone needs to train our future robot overlords ;)
Once you have the images with labels, organized between folders, it is recommended to put ALL the images and labels in the same folder.
As an alternative to manual labelling, automated labelling techniques can be used, like capture a video and use a simple generic tracker that automatically moves the label for each frame of the video. In this case, it's not so useful consecutive frames with minor changes between them, so it should be filtered e.g, one frame per second. Keep in mind also the necessity of precision of labelling, i.e., when a camera is moving the object might scale and the label might not.
Augmenting the dataset is basically generating more training samples without the hassle of manual labelling.
Typical augmentations are copies of the original images+labels and then applying rotations, scaling, cropping (into different background), distortions, etc. It's a bit hard finding open source augmentation scripts as they are considered "easy" openCV scripts, but here are some examples: 1, 2, 3.
Adversary examples: Random images that do not contain the objects to be detected, or have similar but not quite the same as we want to be detected. This makes the neural network converge better on the objects that we want it to detect. These files should be included in the dataset, with empty label txt files.
And that's it, with much less effort and time comparing to the manual labelling, now you have a much richer and bigger dataset that will enhance the neural network training.
Now, about the yolo configuration and parameters files:
- obj.names - This is the same file used above for the labelling, one label per line. Example
- obj.data - A file that contains the object classes amount and the relative paths to the datasets filelist. Example. The relative paths should be set for the paths in the cloud (e.g. google drive) or local if the training is done locally (GPU "necessary"). In this example, there are 4 classes (as can be confirmed by the .names file), the "valid" is the same as the "train" one (check the yolo description for more details), the "names" is the relative path to the .names file and the "backup" is the filename where the intermediate trained weights will be backed up.
- train.txt - Another file necessary for yolo training is the "train.txt". This is a file that contains all the image filepaths that will be used in the training. A simple way to generate this file is to creatively modify the output of
ls
, by e.g.ls *.{jpg,JPG,png} > train.txt
in the folder containing all the labelled images. Remember to manipulate/filter this file, so that each line contains only the relative path to the image, e.g.content/image.png
. - yolov3.cfg - This is the main configuration file that will be used by the training. Doesn't need to be changed, but keep in mind of its location.
For more details, check the official YOLOv3 documentation.
These files will be addressed in the training component.
Now that you have the training dataset labelled, augmented and configuration files ready, it's time to finally start the training. Keep in mind that this can be (it normally is) an iterative process, i.e, you collect some images, label them, train, check how are the detections, analyse, decide what can be improved, capture more images and repeat. The way the weights backups system is done, was also developed taking this into consideration, so it is important to do manual backups during the training (as it takes several hours) and analyse the detections quality, capture and label more in parallel to the training, etc.
A straightforward way to train, is to
- Have a google account with enough space in google drive (for the dataset images)
- Have access to Google Collab and use the following notebook
This has been an approach used in the past, and an already modified book can be accessed in this link.
Follow the steps in the notebook and adapt, where necessary, the paths to your own google drive "darknet" folder, as well as the paths to the configuration files mentioned in step 4. (don't forget to adapt the paths inside the files themselves).
Make sure to follow the step where you have access to a free GPU in collab, as this is the most critical point of using this approach for training.
There are limitation in time and availability of the user. Google implemented safeguards agains abuse, so for example, you should keep the browser window in focus and move your mouse once in a while. Once every couple of hours, it also asks for a captcha. Be warned that if you fail to react to these prompts, the GPU access is cut forever.
When you get kicked out and lose access to a GPU and the training execution, fear not, as nothing is lost because the backups once every couple epochs are saved to google drive.
To continue with a GPU in collab, unfortunately you have to use another user, so it's recommended to have multiple google accounts to repeat the steps and continue the training, sharing the resources (dataset, configuration files, darknet folder) on google drive between these users.
In case of continue of training, load the previously backuped weights file before the kicking.
YOLO MARK
To label images, yolo mark is a good tool: YOLO MARK
To use yolo mark in linux, just run:
cmake .
make
./linux_mark.sh
To use for labeling your custom images:
-
Delete all files from directory
x64/Release/data/img
-
Put your
.jpg
-images to this directoryx64/Release/data/img
-
Change numer of classes (objects for detection) in file
x64/Release/data/obj.data
: https://github.com/AlexeyAB/Yolo_mark/blob/master/x64/Release/data/obj.data#L1 -
Put names of objects, one for each line in file
x64/Release/data/obj.names
: https://github.com/AlexeyAB/Yolo_mark/blob/master/x64/Release/data/obj.names
TRAINING
- Create file
yolo-obj.cfg
with the same content as 'yolov3-mbot'. Parameters to change: -Comment the testing and uncomment the training -When using the images obtained from bags it is important to use: width=640 height=480
-there will be 3 [yolo] layers. The definition of the number of classes will be present there, so it is necessary to change when training. Eg:
[yolo]
mask = 0,1,
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=13
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
before each [yolo] layer there is a conv layer, and in the three conv layers that are before the yolo layer, it is necessary to change the number of filters to: filters=(classes + 5)x3
[convolutional]
size=1
stride=1
pad=1
filters=54
activation=linear
-- This cfg file should be stored in the cfg folder
- Inside the data folder, you need:
-
data/obj folder with the images you got from yolo mark and the .txt files
-
.names file
-
.data file
-
train.txt file (done automatically by yolo mark)
-
Download pre-trained weights for the convolutional layers (154 MB): https://pjreddie.com/media/files/darknet53.conv.74 and put it into the darknet folder
-
Start training by using the command line:
./darknet detector train data/obj.data cfg/yolo-obj.cfg darknet53.conv.74
-file yolo-obj_xxx.weights
will be saved to the \darknet\backup\
for each 100 iterations
Note: If during training you see nan
values for avg
(loss) field - then training goes wrong, but if nan
is in some other lines - then training goes well.
In google colab: https://colab.research.google.com/drive/1NufUk1ZnMdKcLN6J16QMaPEEkCtIpK4S?usp=sharing
SocRob Drive: https://drive.google.com/drive/u/0/folders/1YOv4FWU71OCGvSLpDZcqsfs_frHgfL64 Approximately 1200 images
The weights from past competitions are (as of 21/12/2021) shared in the following google drive.
These weights can be used directly on the decector, but keep in mind changes to the yolo configuration files (.names, .cfg, etc).
This pkg is a workaround since at the time of implementation, darknet_ros was not as established. This package can be replaced by the "official" one from leggedrobotics here.
For this, the messages need to be adapted: msg type in parallel_object_detector pkg