\n {teaser > 0 && document.text.length > 0 &&\n \n }\n {document.tags && (\n\n {translation(\"search.result.term.missing\")}: {...missing}\n
\n }\ndiff --git a/404.html b/404.html index 5e9354d..3b0253d 100644 --- a/404.html +++ b/404.html @@ -9,7 +9,7 @@ - + @@ -17,10 +17,10 @@ - + - + @@ -54,7 +54,7 @@ -
+ @@ -303,10 +303,10 @@Alessandro Cerioni, Etat de Geneve - Adrian F. Meyer, FHNW
-Published on November 22, 2021
-Abstract: The STDL develops a framework allowing users to train and use Deep Learning models to detect objects in aerial images. While relying on a generic-purpose third-party Open Source library, the STDL framework implements a somewhat opinionated workflow, targeting georeferenced aerial images and labels. After a brief introduction to object detection, this article provides detailed information on the STDL Object Detection framework and the opinions it implements. References to successful applications are provided along with concluding remarks.
+Alessandro Cerioni, Etat de Geneve - Clémence Herny, Exolabs - Adrian F. Meyer, FHNW - Gwenaëlle Salamin, Exolabs
+Published on November 22, 2021 +Updated on December 12, 2023
+Abstract: The STDL develops a framework allowing users to train and use deep learning models to detect objects in aerial images. While relying on a generic-purpose third-party open-source library, the STDL's framework implements a somewhat opinionated workflow, targeting georeferenced aerial images and labels. After a brief introduction to object detection, this article provides detailed information on the STDL's object detection framework and the opinions it implements. References to successful applications are provided along with concluding remarks.
Object detection is a computer vision task which aims at detecting instances of objects of some target classes (e.g. buildings, swimming pools, solar panels, ...) in digital images and videos.
According to the commonly adopted terminology, a distinction is made between the following tasks:
@@ -355,9 +356,9 @@Significant progress has been made over the past decades in the domain of object detection and instance segmentation (cf. e.g. this review paper). Applications of object detection methods are today popular also in real-world products: for instance, some cars are already capable of detecting and reading speed limit signs; social media applications integrate photo and video effects based on face and pose detection. All these applications usually rely on Deep Learning methods, which are the subset of Machine Learning methods leveraging Deep Neural Networks. While referring the reader to other sources for further information on Machine and Deep Learning methods (cf. e.g. these lecture notes), we wish to highlight a point which is key in all these approaches based on learning: no rigid, static, human-engineered rule is given to the machine to accomplish the task. Instead, the machine is provided with a collection of input-output pairs, where the output represents the outcome of a properly solved task. As far as object detection is concerned, we provide Deep Learning algorithms with a set of images accompanied by reference annotations ("ground truth labels"), which the machine is expected to reproduce. Clearly, things become interesting when the machine learns how to generate acceptable detections/segmentations on previously unseen images; such a crucial ability is referred to as "generalization". Strategies exist to measure and improve generalization (more on this here-below).
-A generic framework was developed within the STDL, allowing the usage of state-of-the-art Machine Learning methods to detect objects in aerial images. Such framework allows one to leverage aerial images e.g. to provide valuable hints towards the update of cadastral information. At least as far as Switzerland is concerned, high-resolution (< 30 cm Ground Sample Distance) are acquired at the cantonal and federal scales on a regular basis. An inventory of the STDL applications will be provided at the end of this article.
-In its current version, the STDL object detection framework is powered at its very core by Detectron2, a Python library developed by the Facebook Artificial Intelligence Research group and released under the Apache 2.0 Open Source license. Detectron2 includes methods to train models performing various tasks, object detection and instance segmentation to name a few. Specific, slightly opinionated pieces of code were written by the STDL in order to pre-process data to be input to Detectron2, as well as to post-process outputs and turn them into meaningful information. More precisely, our developments enable the usage of Detectron2 with aerial images and georeferenced labels.
+Significant progress has been made over the past decades in the domain of object detection and instance segmentation (cf. e.g. this review paper). Applications of object detection methods are today popular also in real-world products: for instance, some cars are already capable of detecting and reading speed limit signs; social media applications integrate photo and video effects based on face and pose detection. All these applications usually rely on deep learning methods, which are the subset of machine learning methods leveraging deep neural networks. While referring the reader to other sources for further information on machine and deep learning methods (cf. e.g. these lecture notes), we wish to highlight a key point in all these approaches based on learning: no rigid, static, human-engineered rule is given to the machine to accomplish the task. Instead, the machine is provided with a collection of input-output pairs, where the output represents the outcome of a properly solved task. As far as object detection is concerned, we provide deep learning algorithms with a set of images accompanied by reference annotations ("ground truth labels"), which the machine is expected to reproduce. Clearly, things become interesting when the machine learns how to generate acceptable detections/segmentation on previously unseen images; such a crucial ability is referred to as "generalization".
+A generic framework was developed within the STDL, allowing the usage of state-of-the-art machine learning methods to detect objects in aerial images. Such framework allows one to leverage aerial images e.g. to provide valuable hints towards the update of cadastral information. At least as far as Switzerland is concerned, high-resolution (< 30 cm Ground Sample Distance) are acquired at the cantonal and federal scales on a regular basis. An inventory of the STDL's applications will be provided at the end of this article.
+The STDL's object detection framework is powered at its core by Detectron2, a Python library developed by the Facebook Artificial Intelligence Research group and released under the Apache 2.0 open-source license. Detectron2 includes methods to train models performing various tasks, object detection and instance segmentation to name a few. Specific pieces of code were written by the STDL to pre-process data to be input to Detectron2, as well as to post-process outputs and turn them into meaningful information for the projects being developed. More precisely, our developments enable the usage of Detectron2 with georeferenced images and labels.
Our workflow goes through the steps described here-below.
Through this 1st step of our workflow, several requests are issued against a Web Service in order to generate a consistent set of tiled images ("tileset") covering the so-called "Area of Interest" (AoI), namely the area over which the user intend to train a predictive model and/or to perform the actual object detection. Connectors for the following two Web Services have been developed so far:
+Through this 1st step of our workflow, several requests are issued against a Web Service in order to generate a consistent set of tiled images ("tileset") covering the so-called "area of interest" (AoI), namely the area over which the user intend to train a detection model and/or to perform the actual object detection. Connectors for the following two Web Services have been developed so far:
-Our framework is agnostic with respect to the tiling scheme, which the user has to provide as a GeoJSON input file, compliant with some requirements. We refer the user to the code documentation for detailed information about these requirements.
+Except when using the XYZ connector that requires the EPSG:3857, our framework is agnostic with respect to the tiling scheme. The user has to provide as a input file compliant with some requirements. We refer the user to the code documentation for detailed information about these requirements.
Concerning the AoI and its extension, the following scenarios are supported:
Since Detectron2 natively supports the latter but not the former, we made the obvious choice to opt for the COCO format.
-As mentioned here-above, Machine Learning models are valuable as far as they do not "overfit" to the training data; in other words, as far as they generalize well to new, unseen data. One of the techniques which are commonly used in order to prevent Machine Learning algorithms from overfitting is the "train, validation, test split". While referring the interested reader to this Wikipedia page for further details, let us note that a 70%-15%-15% split is currently hard-coded in our framework.
+As mentioned above, machine learning models are valuable as far as they do not "overfit" to the training data; in other words, as far as they generalize well to new, unseen data. One of the techniques which are commonly used in order to prevent machine learning algorithms from overfitting is the "train, validation, test split". While referring the interested reader to this Wikipedia page for further details, let us note that a 70%-15%-15% split is currently hard-coded in our framework.
Various independent COCO tilesets are generated, depending on the scenario:
In inference-only scenarios, a single COCO tileset labeled as "other" is generated (oth
is the abbreviation we use).
In training + inference scenarios, the full collection of tilesets is generated: trn
, val
, tst
, oth
At the end of step no. 1 we are left with a collection of consistent (i.e. same size and resolution) tiled images and corresponding COCO files (trn
+ val
+ tst
and/or oth
depending on the scenario).
This 2nd step consists in training a prediction model by iterating over the training dataset, as customarily done in Machine/Deep Learning. As already mentioned, we delegate this crucial part of the process to the superb Detectron2 library; support for other libraries may be implemented in the future, if suitable.
+The 1st step provides a collection of consistent (i.e. same size and resolution) tiled images and corresponding COCO files (trn
+ val
+ tst
and/or oth
depending on the scenario).
The 2nd step consists in training a prediction model by iterating over the training dataset, as customarily done in machine and deep learning. As already mentioned, we delegate this crucial part of the process to the Detectron2 library; support for other libraries may be implemented in the future, if suitable.
Detectron2 comes with a large collection of pre-trained models tailored for various tasks. In particular, as far as instance segmentation is concerned, pre-trained models can be selected from this list.
In our workflow, we setup Detectron2 in such a way that inference is made on the validation dataset every N training iterations, being N a user-defined parameter. By doing this, we can monitor both the training and validation losses all along the iterative learning and decide when to stop. Typically, learning is stopped when the validation loss reaches a minimum (cf. e.g. this article for further information on early stopping). As training and validation loss curves are somewhat noisy, typically an on-the-fly smoothing is applied in order to reveal steady trends. Other metrics may be tracked and used to decide when to stop. For now, within our framework (early) stopping can be done manually and is left to the user; it will be made automatic in the future, following some suitable criterion.
@@ -421,38 +423,51 @@
Let us note that the learning process is regulated by several parameters, which are usually called "hyper-parameters" in order to distinguish them from the learned "parameters", the latter being - in our deep learning context - the coefficients of the many neurons populating the various layers of the deep neural network. In successful scenarios, the iterative learning process does actually lower the validation loss until a minimum value is reached. Yet, such a minimum is likely to be a "local" one (i.e. relative to a given set of hyper-parameters); indeed, the global minimum may be found along a different trajectory, corresponding to a different set of hyper-parameters. Actually, even finding the global minimum of the validation loss could be not as relevant as checking how different models resulting from different choices of the hyper-parameters compare with each other on the common ground of more meaningful "business metrics". Even though currently our code does not implement any automatic hyper-parameter tuning, neither in terms of expert metrics nor in terms of business ones, we have already setup everything that is needed to produce business metrics, as explained here-below.
The model trained at the preceding step can be used to perform the actual object detection / instance segmentation over the various tilesets concerned by a given study:
+The model trained at the preceding step can be used to perform the actual object detection or instance segmentation over the various tilesets concerned by a given study:
Depending on the configuration, Detectron2 is capable to perform either object detection AND instance segmentation at once, or object detection only. In both cases, every detection is accompanied by the following information:
In the case of object detection only, a bounding box is output as a list of vertices relative to the image coordinate system. In case instance segmentation is demanded, detections are also output under the form of binary ("monochromatic") masks, one per input tile/image, in which pixels belonging to target objects are encoded with ones whereas background pixels are encoded with zeros.
+In the case of object detection only, a bounding box is output as a list of vertices relative to the image coordinate system. In case instance segmentation is demanded, detections are also output under the form of binary ("monochromatic") masks, one per input tile/image, in which pixels belonging to target objects are encoded with ones whereas background pixels are encoded with zeros.
+Detectron2 output is then converted into a georeferenced vector. Polygon geometry can be simplified using the Ramer-Douglas-Peucker algorithm (RDP) by tuning the epsilon parameter.
As already mentioned, several expert metrics are output by Detectron2 all along the learning process, concerning the training dataset and, optionally, the validation dataset too. As these metrics can be confusing for business experts (cf. this article for further details on such metrics), the STDL decided to try making things as simple as possible by turning these detections into georeferenced polygons. Not only such polygons can then be visualized by GIS tools like QGIS or ArcGIS Pro, but also spatial intersections can be computed with ground truth labels in order to tag detections according to the following classes:
+The results are evaluated by matching the detections and the ground truth labels, i.e. by finding detections overlapping with ground truth labels. To be considered a match, the intersection over union (IoU) between the detection polygon and label polygon must be greater than a threshold set by the user, with a default value = 0.25. In addition, if there are intersections between several detections and ground truth labels, only the pair with the largest IoU is considered to be a valid match.
+Intersection over union between a label and a detection is defined as:
The reader may wonder why there are no True Negatives (TN) in the list. Actually, all the image pixels which are rightly associated to none of the target classes can be considered as "True Negatives". Yet, as far as object detection and instance segmentation are concerned, we do not need to group unclassified pixels into some sort of "dummy objects". Should the user need to model such a scenario, one idea might consist in introducing a dummy class (e.g. "background"), to which all the (ir)relevant pixels would be associated.
-Counting the number of TPs, FPs and FNs allows one to compute some rather common metrics:
+The spatial intersection between the vectorized detections and the ground truth labels is computed to tag detections according to the following classification:
The reader may wonder why there are no true negatives (TN) in the list. Actually, all the image pixels which are rightly not associated with the target class can be considered as "true negatives". Yet, as far as object detection and instance segmentation are concerned, we do not need to group leftover pixels into some sort of "dummy objects". Should the user need to model such a scenario, one idea might consist in introducing a dummy class (e.g. "background" or "other"), to which all the (ir)relevant pixels would be associated. +The metrics are calculated per-class to take into account possible imbalances between classes. The detections in the wrong class will be classified as FN, i.e. missed object, or false positive (FP), i.e. detections not matching any object, depending on the target class we are calculating for.
+Precision and recall by class are used here:
+While referring the reader to this page for further information on these metrics, let us note that:
Each metric can be aggregated to keep only one value per dataset, rather than one per class.
+As already mentioned, each detection is assigned a confidence score, ranging from 0 to 1. By filtering out all the detections exhibiting a score smaller than some cut-off/threshold value, one would end up having more or less detections to compare against ground truth data; the higher the threshold, the smaller the number of detections, the better their quality in terms of the confidence score. Sampling the threshold from a minimal (e.g. 0.05) to a maximum value (e.g. 0.95) and counting TPs, FPs, FNs at each sampling step, meaningful curves can be obtained representing counts and/or metrics like precision and recall as a function of the threshold. Typically, precision (recall) is monotonically increasing (decreasing) as a function of the threshold. As such, neither the precision nor the recall can be used to determine the optimal value of the threshold, which is why precision and recall are customarily aggregated in order to form a third metric which can be convex if computed as a function of the threshold or, at least, it can exhibit local minima. This metric is named "\(F_1\) score" and is defined as follows:
+As already mentioned, each detection is assigned a confidence score, ranging from 0 to 1. By filtering out all the detections exhibiting a score smaller than some cut-off/threshold value, one would end up having more or less detections to compare against ground truth data; the higher the threshold, the smaller the number of detections, the better their quality in terms of the confidence score. Sampling the threshold from a minimal user-defined value to a maximum value (e.g. 0.95) and counting TPs, FPs, FNs at each sampling step, meaningful curves can be obtained representing counts and/or metrics like precision and recall as a function of the threshold. Typically, precision (recall) is monotonically increasing (decreasing) as a function of the threshold. As such, neither the precision nor the recall can be used to determine the optimal value of the threshold, which is why precision and recall are customarily aggregated in order to form a third metric which can be convex if computed as a function of the threshold or, at least, it can exhibit local minima. This metric is named "\(F_1\) score" and is defined as follows:
Several training sessions can be executed, using different values of the various hyper-parameters involved in the process. As a matter of fact, reviewing and improving ground truth data is also part of the hyper-parameter tuning (cf. "From Model-centric to Data-centric Artificial Intelligence''). Keeping track of the above-mentioned metrics across multiple realizations, eventually an optimal model should be found (at least, a local optimum).
-The exploration of the hyper-parameter space is a tedious task, which consumes time as well as human and computing resources. It can be performed in a more or less systematic/heuristic way, depending on the experience of the operator as well as on the features offered by the code. Typically, a partial exploration is enough to obtain acceptable results. Within the STDL team, it is customary to first perform some iterations until "decent scores" are obtained, then to involve beneficiaries and domain experts in the continuous evaluation and improvement of results, until satisfactory results are obtained. These exchanges between Data Scientists and domain experts are also key to raise both communities' awareness of the virtues and flaws of Machine Learning approaches.
+The exploration of the hyper-parameter space is a tedious task, which consumes time as well as human and computing resources. It can be performed in a more or less systematic/heuristic way, depending on the experience of the operator as well as on the features offered by the code. Typically, a partial exploration is enough to obtain acceptable results. Within the STDL team, it is customary to first perform some iterations until "decent scores" are obtained, then to involve beneficiaries and domain experts in the continuous evaluation and improvement of results, until satisfactory results are obtained. These exchanges between data scientists and domain experts are also key to raise both communities' awareness of the virtues and flaws of machine learning approaches.
Here's a list of the successful applications of the Object Detection Framework described in this article:
+Here is a list of the successful applications of the object detection framework described in this article:
The STDL Object Detection Framework is still under development and receives updates as new use cases are tackled. Its source code will be soon released under Open Source terms: stay tuned!
+The STDL's object detection framework is still under development and receives updates as new use cases are tackled.
@@ -535,10 +552,10 @@\n {translation(\"search.result.term.missing\")}: {...missing}\n
\n }\n