Skip to content

A reinforcement learning object detector leveraging saliency ranking, offering a self-explainable system with a fully observable action log. | B.Sc. IT (Hons) Artificial Intelligence Dissertation | UOM Dean's List Awards 2024

License

Notifications You must be signed in to change notification settings

OVD-Labs/SaRLVision

Repository files navigation

SaRLVision

"A reinforcement learning object detector which leverages saliency ranking."


"A self-explainable detector that provides a fully observable action log."

GIF1 GIF2 GIF3
GIF4 GIF5 GIF6
GIF7 GIF8 GIF9
GIF10 GIF11 GIF12

Abstract

In an era where sustainability and transparency are paramount, the importance of effective object detection algorithms, pivotal for enhancing efficiency, safety, and automation across various domains, cannot be overstated. While these algorithms such as YOLO and Faster R-CNN, are notably fast, unfortunately they lack transparency in their decision-making process. This study explores a series of experiments on object detection, which combines reinforcement learning-based visual attention methods with saliency ranking techniques, in an effort to investigate transparent and sustainable solutions. By employing saliency ranking techniques that emulate human visual perception, the reinforcement learning agent is provided with an initial bounding box prediction. The agent, then iteratively refines these bounding box predictions by selecting from a finite set of actions over multiple time steps, ultimately achieving accurate object detection. This research also investigates the use of various image feature extraction methods, in addition to exploring diverse Deep Q-Network (DQN) architectural variations for deep reinforcement learning-based localisation agent training. Additionally, it focuses on optimising the pipeline at every juncture by prioritising lightweight and faster models. Another feature of the proposed system includes the classification of detected objects, a capability absent in previous reinforcement learning approaches. After evaluating the performance of these agents using the Pascal VOC 2007 dataset, faster and more optimised models were developed. Notably, the best mean Average Precision (mAP) achieved in this study was 51.4, surpassing benchmarks from RL-based single object detectors present in the literature. The designed system provides a distinct edge over previous methods by allowing multiple configurable real-time visualisations. These visualisations offer users a clear view of the current bounding boxes' coordinates and the types of actions being performed, both of which enable a more intuitive understanding of algorithmic decisions. Ultimately, fostering trust and transparency in object detection systems, aiding in the deployment of artificial intelligence techniques in high-risk areas, while continuously advancing research in the field of AI.

System Overview

Initially, the system proceeds to generate a saliency ranking heatmap using the input image, emphasising regions of interest. It then takes the most important ranks to create an initial bounding box prediction, which is a key stage in object localisation. This prediction is then fed to the RL environment, where an agent navigates through a series of time steps, repeatedly completing actions to improve the bounding box and precisely pinpointing the object within an image, while also predicting the object class label.

Architecture

Saliency Ranking

The initial process in the development of the system involves the utilisation of saliency ranking to derive an initial bounding box estimate. Alternatively, users may choose not to employ this technique, resulting in the initial bounding box covering the entirety of the input image, a practice commonly observed in existing literature. Following the acquisition of the Saliency Ranking heatmap from SaRa, the first stage of this process entails the extraction of a bounding box that delineates the pertinent image segments. This technique considers a proportion of the highest-ranked areas, with a fixed threshold of 30% and number of iterations set to 1. The generation of these initial bounding boxes is critical due to the fact that it allows for the separation and delineation of prominent regions in the image for further refining utilising RL techniques.

SaRa

Reinforcement Learning

In the subsequent phase of the devised pipeline, reinforcement learning is harnessed to accomplish object localisation within the images. To this extent the developed system was built via the gymnasium API, which facilitated the formulation of the problem as a Markov Decision Process (MDP), inspired from the existing literature. Subsequently, Deep Reinforcement Learning (DRL) techniques were applied to approximate the object detection problem.

Action Space

Similar to methodologies commonly employed in object localisation tasks, the action set $A$ consists of eight transformations that can be applied to the bounding box, along with one action designated to terminate the search process. These transformations are grouped into four subsets: horizontal and vertical box movement, scale adjustment, and aspect ratio modification. Consequently, the agent has four degrees of freedom to adjust the bounding box $[x_1, y_1, x_2, y_2]$ during interactions with the environment. Additionally, a trigger action is incorporated to indicate successful object localisation by the current box, thereby concluding the ongoing search sequence, and drawing an IoR marker on the detected object.

Actions

Deep Q-Network Architecture

The DQN architecture, introduced in the presented system, assumes responsibility for decision-making in object localisation. To this extent, the designed architecture draws inspiration from methodologies present in the prevalent literature. Our proposed approach, introduces four DQN variants:

  1. Vanilla DQN (DQN)
  2. Double DQN (DDQN)
  3. Dueling DQN (Dueling DQN)
  4. Double Dueling DQN (D3QN)

Our approach advocates for a deeper DQN network to bolster decision-making capabilities and enhance learning complexity. To mitigate concerns regarding overfitting, dropout layers are seamlessly integrated into the network architecture. Additionally, this work develops a Dueling DQN Agent to improve learning efficiency by decoupling state and advantage functions. The Dueling DQN design divides the $Q$-value function into two streams, allowing the agent to better comprehend the value of doing specific actions in different situations. The proposed approach also evaluates DDQN and D3QN techniques, which have also not been previously examined, in pursuit of achieving better results.

DQN Architecture

Self-Explainability

The study proposes a system that creates a log and displays the current environment in several rendering modes to illustrate explainability, as demonstrated below:

Visualisations

These visualisations provide users with insights into the current action being performed, the current IoU, the current Recall, the environment step counter, the current reward, and a clear view of the current bounding box and ground truth bounding box locations in the original image. Furthermore, unlike all object detectors and methodologies previously discussed, this methodology permits decision-making observation during the training phase, albeit there is a slight time overhead for the creation of visualisations. Nonetheless, the system provides a clear log outlining the framework's decision-making process for current item detection, allowing insight into the object detector's training and assessment, as observed below:

Self-Explainability

SaRLVision Window

The SaRLVision Window provides a real-time view of the object detection process, displaying the current state of the environment, the actions being taken, and the corresponding results. This interactive window is designed to be user-friendly, providing a clear and intuitive interface for users to understand the workings of the system.

Window1 Window2 Window3

This window is implemented using the pygame library, which is a popular framework for building interactive applications and games in Python. Pygame provides a set of functions and classes for creating graphical user interfaces, handling user input, and rendering graphics on the screen. By utilizing the Pygame API, the SaRLVision Window is able to provide a visually appealing interface to users.

Evaluation

The designed SaRLVision system was extensively tested for a number of environment configurations with the aforementioned agents across a set of experiments. The system was evaluated through the mean Average Precision (mAP) metric, which is a widely used metric for evaluating object detection models. The mAP metric is calculated by averaging the precision-recall curve for each class, and then averaging the results across all classes. This metric provides a comprehensive evaluation of the model's performance, taking into account both precision and recall. Furthermore, the developed agents where trained on the Pascal VOC 2007+2012 training set and evaluated on the Pascal VOC 2007 test set, which is a standard benchmark dataset for object detection, as prevalent in the literature. The results showcasing different agents and different environment configurations are presented below:

Evaluation Diagram

A Demonstration

Need a visual demonstration of the system in action? Check out the following video:

https://www.youtube.com/watch?v=S8gUYhOAAkE

Getting Started

The following jupyter notebooks are provided to demonstrate the functionality of the system:

Installation

To get started, clone the repository and navigate to it:

git clone https://github.com/mbar0075/SaRLVision.git
cd SaRLVision

You can also clone the environment used for this project using the environment.yml file provided in the Requirements directory. To do so, you will need to have Anaconda installed on your machine. If you don't have Anaconda installed, you can download it from here. Once you have Anaconda installed, you can run the following commands to install the environment and activate it

To install the environment, run the following command:

cd Requirements
conda env create -f environment.yml
conda activate SaRLVision

Alternatively you can create the environment manually by running the following commands and install the packages in the requirements.txt file in the Requirements directory:

cd Requirements
conda create --name SaRLVision python=3.9.16
conda activate SaRLVision
pip install -r requirements.txt

In case you want to install the packages manually, you can do so by running the following commands:

pip install . . .
pip install notebook
conda install swig
conda install nomkl
pip install opencv-python
pip install gymnasium[all]
pip install ufal.pybox2d
pip install pygame
pip install renderlab
pip install numpy
pip install matplotlib
pip install pandas
pip install seaborn
pip install scikit-learn
pip install pycotools

# Installing pytorch with CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118/torch_stable.html

# Installing tensorflow with CUDA 11.2
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
# Anything above 2.10 is not supported on the GPU on Windows Native
python -m pip install "tensorflow<2.11"
# Verify the installation:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

In case of any further issues, you can install cuda from the following links: NVIDIA CUDA Toolkit, Windows 11.8, and install the corresponding pytorch and tensorflow versions from the following links: PyTorch, TensorFlow, respectively.

About

A reinforcement learning object detector leveraging saliency ranking, offering a self-explainable system with a fully observable action log. | B.Sc. IT (Hons) Artificial Intelligence Dissertation | UOM Dean's List Awards 2024

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published