-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsegmentation.html
574 lines (465 loc) · 30 KB
/
segmentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
<!DOCTYPE html>
<html>
<head>
<title>Ch. 6 - Object Detection and
Segmentation</title>
<meta name="Ch. 6 - Object Detection and
Segmentation" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://manipulation.csail.mit.edu/segmentation.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="chapters.js"></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('manipulation');">
<div data-type="titlepage" pdf="no">
<header>
<h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
<p data-type="subtitle">Perception, Planning, and Control</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2020-2022<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p pdf="no"><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2022/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2022 semester. <!-- <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.--></p>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=clutter.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=deep_perception.html>Next Chapter</a></td>
</tr></table>
<script type="text/javascript">document.write(notebook_header('segmentation'))
</script>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 5"><h1>Object Detection and
Segmentation</h1>
<p>Our study of geometric perception gave us good tools for estimating the
pose of a known object. These algorithms can produce highly accurate
estimates, but are still subject to local minima. When the scenes get more
cluttered/complicated, or if we are dealing with many different object types,
they really don't offer an adequate solution by themselves.</p>
<p>Deep learning has given us data-driven solutions that complement our
geometric approaches beautifully. Finding correlations in massive datasets
has proven to be a fantastic way to provide practical solutions to these more
"global" problems like detecting whether the mustard bottle is even in the
scene, segmenting out the portion of the image / point cloud that is relevant
to the object, and even in providing a rough estimate of the pose that could
be refined with a geometric method.</p>
<p>There are many sources of information about deep learning on the internet,
and I have no ambition of replicating nor replacing them here. But this
chapter does being our exploration of deep perception in manipulation, and I
feel that I need to give just a little context.</p>
<section><h1>Getting to big data</h1>
<subsection><h1>Crowd-sourced annotation datasets</h1>
<p>The modern revolution in computer vision was unquestionably fueled by
the availability of massive annotated datasets. The most famous of all is
ImageNet, which eclipsed previous datasets with the number of images and
the accuracy and usefulness of the labels<elib>Russakovsky15</elib>.
Fei-fei Li, who led the creation of ImageNet has been giving talks that
give some nice historical perspective on how ImageNet came to be.
<a href="https://youtu.be/0bb9EcaQupI">Here is one</a> (slightly) tailored
to robotics and even manipulation; you might start <a
href="https://youtu.be/0bb9EcaQupI?t=857">here</a>. </p>
<p><elib>Russakovsky15</elib> describes the annotations available in
ImageNet: <blockquote>... annotations fall into one of two categories: (1)
image-level annotation of a binary label for the presence or absence of an
object class in the image, e.g., "there are cars in this image" but "there
are no tigers," and (2) object-level annotation of a tight bounding box
and class label around an object instance in the image, e.g., "there is a
screwdriver centered at position (20,25) with width of 50 pixels and
height of 30 pixels".</blockquote></p>
<figure><img width="80%"
src="data/coco_instance_segmentation.jpeg"/><figcaption>A sample annotated
image from the COCO dataset, illustrating the difference between
image-level annotations, object-level annotations, and segmentations at
the class/semantic- or instance- level..</figcaption></figure>
<p>The COCO dataset similarly enabled pixel-wise <i>instance-level</i>
segmentation <elib>Lin14a</elib>, where distinct instances of a class are
given a unique label (and also associated with the class label). COCO has
fewer object categories than ImageNet, but more instances per category.
It's still shocking to me that they were able to get 2.5 million images
labeled at the pixel level. I remember some of the early projects at MIT
when crowd-sourced image labeling was just beginning (projects like
LabelMe <elib>Russell08</elib>); Antonio Torralba used to joke about how
surprised he was about the accuracy of the (nearly) pixel-wise annotations
that he was able to crowd-source (and that his mother was a particularly
prolific and accurate labeler)!</p>
<p>Instance segmentation turns out to be an very good match for the
perception needs we have in manipulation. In the last chapter we found
ourselves with a bin full of YCB objects. If we want to pick out only the
mustard bottles, and pick them out one at a time, then we can use a deep
network to perform an initial instance-level segmentation of the scene,
and then use our grasping strategies on only the segmented point cloud.
Or if we do need to estimate the pose of an object (e.g. in order to place
it in a desired pose), then segmenting the point cloud can also
dramatically improve the chances of success with our geometric pose
estimation algorithms.</p>
</subsection>
<subsection><h1>Segmenting new classes via fine tuning</h1>
<p>The ImageNet and COCO datasets contain labels for <a
href="https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/">a
variety of interesting classes</a>, including cows, elephants, bears,
zebras and giraffes. They have a few classes that are more relevant to
manipulation (e.g., plates, forks, knives, and spoons), but they don't
have a mustard bottle nor a can of potted meat like we have in the YCB
dataset. So what are we do to? Must we produce the same image annotation
tools and pay for people to label thousands of images for us?</p>
<p>One of the most amazing and magical properties of the deep
architectures that have been working so well for instance-level
segmentation is their ability to transfer to new tasks ("transfer
learning"). A network that was pre-trained on a large dataset like
ImageNet or COCO can be fine-tuned with a relatively much smaller amount
of labeled data to a new set of classes that are relevant to our
particular application. In fact, the architectures are often referred to
as having a "backbone" and a "head" -- in order to train a new set of
classes, it is often possible to just pop off the existing head and
replace it with a new head for the new labels. A relatively small amount
of training with a relatively small dataset can still achieve
surprisingly robust performance. Moreover, it seems that training
initially on the diverse dataset (ImageNet or COCO) is actually important
to learn the robust perceptual representations that work for a broad
class of perception tasks. Incredible!</p>
<p>This is great news! But we still need some amount of labeled data for
our objects of interest. The last few years have seen a number of
start-ups based purely on the business model of helping you get your
dataset labeled. But thankfully, this isn't our only option.
</p>
</subsection>
<subsection><h1>Annotation tools for manipulation</h1>
<p>Just as projects like LabelMe helped to streamline the process of
providing pixel-wise annotations for images downloaded from the web,
there are a number of tools that have been developed to streamline the
annotation process for robotics. One of the earliest examples was
<a href="http://labelfusion.csail.mit.edu/">LabelFusion</a>, which combines geometric perception of point clouds
with a simple user interface to very rapidly label a large number of
images<elib>Marion17</elib>.</p>
<figure><img id="labelfusion" style="width:75%" src="data/labelfusion_multi_object_scenes.png" onmouseenter="this.src = 'https://github.com/RobotLocomotion/LabelFusion/raw/master/docs/fusion_short.gif'" onmouseleave="this.src = 'data/labelfusion_multi_object_scenes.png'"/>
<figcaption>A multi-object scene from <a
href="http://labelfusion.csail.mit.edu/">LabelFusion</a>
<elib>Marion17</elib>. (Mouse over for animation)</figcaption>
</figure>
<p>In LabelFusion, the user provides multiple RGB-D images of a static
scene containing some objects of interest, and the CAD models for those
objects. LabelFusion uses a dense reconstruction algorithm,
ElasticFusion<elib>Whelan16</elib>, to merge the point clouds from the
individual images into a single dense reconstruction; this is just
another instance of the point cloud registration problem. The dense
reconstruction algorithm also localizes the camera relative to the point
cloud. To localize a particlar object, like the drill in the image
above, LabelFusion provides a simple gui that asks the user to click on
three points on the model and three points in the scene to establish the
"global" correspondence, and then runs ICP to refine the pose estimate.
In addition to this one registration providing labeled poses in
<i>all</i> of the original images, the pixels from the CAD model can be
"rendered" on top of all of the images in the established pose giving
beautiful pixel-wise labels.</p>
<p>Tools like LabelFusion can be use to label large numbers of images very
quickly (three clicks from a user produces ground truth labels in many
images).</p>
</subsection>
<subsection><h1>Synthetic datasets</h1>
<p>All of this real world data is incredibly valuable. But we have
another super powerful tool at our disposal: simulation! Computer vision
researchers have traditionally been very skeptical of training perception
systems on synthetic images, but as game-engine quality physics-based
rendering has become a commodity technology, roboticists have been using
it aggressively to supplement or even replace their real-world datasets.
The annual robotics conferences now feature regular <a
href="https://sim2real.github.io/">workshops and/or debates</a> on the
topic of "sim2real". For any specific scene or narrow class of objects,
we can typically generate accurate enough art assets (with material
properties that are often still tuned by an artist) and environment maps
/ lighting conditions that rendered images can be highly effective in a
training dataset. The bigger question is whether we can generate a
<i>diverse</i> enough set of data with distributions representative of
the real world to train robust feature detectors in the way that we've
managed to train with ImageNet. But for many serious robotics groups,
synthetic data generation pipelines have significantly augmented or even
replaced real-world labeled data.</p>
<p>There is a subtle reason for this. Human annotations on real data,
although they can be quite good, are never perfect. Labeling errors can
put a ceiling on the total performance achievable by the learning
system<elib>Northcutt21</elib>. Even if we admit the gap between rendered
images and natural images, at some point the ability to generate
arbitrarily large datasets with perfect pixel-wise labels actually
enables training on synthetic datasets to be more surpass the performance
for training on real data even when evaluated on real-world test sets.
</p>
<p>For the purposes of this chapter, I aim to train an instance-level
segmentation system that will work well on our simulated images. For
this use case, there is (almost) no debate! Leveraging the pre-trained
backbone from COCO, I will use only synthetic data for fine tuning.</p>
<p>You may have noticed it already, but the <code>RgbdSensor</code> that we've been using in Drake actually has a "label image" output port that we haven't used yet.</p>
<div>
<script type="text/javascript">
var rgdbsensor = { "name": "RgbdSensor",
"input_ports" : ["geometry_query"],
"output_ports" : ["color_image", "depth_image_32f", "depth_image_16u", "<b>label_image</b>", "X_WB"]
};
document.write(system_html(rgdbsensor, "https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_rgbd_sensor.html"));
</script>
</div>
<p>This output port exists precisely to support the perception training
use case we have here. It outputs an image that is identical to the RGB
image, except that every pixel is "colored" with a unique instance-level
identifier.</p>
<figure><img style="width:100%" src="data/clutter_maskrcnn_labels.png"/>
<figcaption>Pixelwise instance segmentation labels provided by the "label
image" output port from <code>RgbdSensor</code>. I've remapped the colors
to be more visually distinct.</figcaption> </figure>
<example><h1>Generating training data for instance segmentation</h1>
<p>I've provided a simple script that runs our "clutter generator" from
our bin picking example that drops random YCB objects into the bin.
After a short simulation, I render the RGB image and the label image,
and save them (along with some metadata with the instance and class
identifiers) to disk.</p>
<p>I've verified that this code <i>can</i> run on Colab, but to make a
dataset of 10k images using this un-optimized process takes about an
hour on my big development desktop. And curating the files is just
easier if you run it locally. So I've provided this one as a python
script instead.</p>
<pysrc no_colab=True>segmentation_data.py</pysrc>
<todo>Provide a colab version?</todo>
<p>You can also feel free to skip this step! I've uploaded the 10k
images that I generated <a
href="https://groups.csail.mit.edu/locomotion/clutter_maskrcnn_data.zip">here</a>.
We'll download that directly in our training notebook.</p>
</example>
</subsection>
</section>
<section><h1>Object detection and segmentation</h1>
<p>There is a lot to know about modern object detection and segmentation
pipelines. I'll stick to the very basics.</p>
<!-- nice overview slides at
http://cs231n.stanford.edu/slides/2020/lecture_12.pdf . Videos at
https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1
-->
<p>For image recognition (see Figure 1), one can imagine training a standard
convolutional network that takes the entire image as an input, and outputs a
probability of the image containing a sheep, a dog, etc. In fact, these
architectures can even work well for semantic segmentation, where the input
is an image and the output is another image; a famous architecture for this
is the Fully Convolutional Network (FCN) <elib>Long15</elib>. But for
object detection and instance segmentation, even the number of outputs of
the network can change. How do we train a network to output a variable
number of detections?</p>
<p>The mainstream approach to this is to first break the input image up into
many (let's say on the order of 1000) overlapping regions that might
represent interesting sub-images. Then we can run our favorite image
recognition and/or segmentation network on each subimage individually, and
output a detection for each region that that is scored as having a high
probability. In order to output a tight bounding box, the detection
networks are also trained to output a "bounding box refinement" that selects
a subset of the final region for the bounding box. Originally, these region
proposals were done with more traditional image preprocessing algorithms, as
in R-CNN (Regions with CNN Features)<elib>Girshick14</elib>. But the "Fast"
and "Faster" versions of R-CNN replaced even these preprocessing with
learned "region proposal networks"<elib>Girshick15+Ren15</elib>.</p>
<p>For instance segmentation, we will use the very popular Mask R-CNN
network which puts all of these ideas, using region proposal networks and a
fully convolutional networks for the object detection and for the masks
<elib>He17</elib>. In Mask R-CNN, the masks are evaluated in parallel from
the object detections, and only the masks corresponding to the most likely
detections are actually returned. At the time of this writing, the latest
and most performant implementation of Mask R-CNN is available in the <a
href="https://github.com/facebookresearch/detectron2">Detectron2</a>
project from Facebook AI Research. But that version is not quite as
user-friendly and clean as the original version that was released in the
PyTorch <a
href="https://pytorch.org/vision/stable/index.html"><code>torchvision</code></a>
package; we'll stick to the <code>torchvision</code> version for our
experiments here.</p>
<example><h1>Fine-tuning Mask R-CNN for bin picking</h1>
<p>The following notebook loads our 10k image dataset and a Mask R-CNN
network pre-trained on the COCO dataset. It then replaces the head of the
pre-trained network with a new head with the right number of outputs for
our YCB recognition task, and then runs just a 10 epochs of training with
my new dataset.</p>
<p><a target="segmentation_train"
href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/segmentation_train.ipynb"><img
src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
In Colab"/></a> (Training Notebook)
</p>
<p>Training a network this big (it will take about 150MB on disk) is not
fast. I strongly recommend hitting play on the cell immediately after the
training cell while you are watching it train so that the weights are
saved and downloaded even if your Colab session closes. But when you're
done, you should have a shiny new network for instance segmentation of the
YCB objects in the bin!</p>
<p>I've provided a second notebook that you can use to load and evaluate
the trained model. If you don't want to wait for your own to train, you
can examine the one that I've trained!</p>
<p><a target="segmentation_inference"
href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/segmentation_inference.ipynb"><img
src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
In Colab"/></a> (Inference Notebook)
</p>
<figure><img style="width:50%" src="data/segmentation_detections.png"><img
style="width:50%" src="data/segmentation_mask.png"><figcaption>Outputs
from the Mask R-CNN inference. (Left) Object detections. (Right) One of
the instance masks.</figcaption></figure>
</example>
</section>
<section><h1>Putting it all together</h1>
<p>We can use our Mask R-CNN inference in a manipulation to do selective
picking from the bin...</p>
<!-- Segmented RGB => segmented point clouds -->
<!-- Segmentation + ICP; Segmentation => antipodal grasps -->
</section>
<section><h1>Variations and Extensions</h1>
<subsection><h1>Pretraining wth self-supervised learning</h1>
<!-- autoencoders -->
<!-- simclr https://arxiv.org/abs/2002.05709 -->
<!-- byob -->
<!-- simsiam -->
<!-- tcc https://arxiv.org/abs/1904.07846 -->
<!-- mention dense descriptors? -->
</subsection>
<subsection><h1>Leveraging large-scale models</h1>
<p>One of the goals for these notes is to consider "open-world"
manipulation -- making a manipulation pipeline that can perform useful
tasks in previously unseen environments and with unseen models. How can
we possibly provide labeled instances of every object the robot will ever
have to manipulate?</p>
<p>The most dramatic examples of open-world reasoning have been coming
from the so-called "foundation models"<elib>Bommasani21</elib>. The
foundation model that has been adopted most quickly into robotics
research is the large vision + text model,
<a href="https://openai.com/blog/clip/">CLIP</a>
<elib>Radford21</elib>.</p>
<p>More coming soon...</p>
<!-- GPT-3 and DALL-E https://openai.com/blog/dall-e/ -->
<!-- CLIP; https://arxiv.org/pdf/2104.13921.pdf -->
<!-- kevin zakka's prototype of CLIP zero-shot object detector:
https://drakedevelopers.slack.com/archives/C013AP7F5M5/p1627446717032400
-->
</subsection>
</section>
<section><h1>Exercises</h1>
<exercise><h1>Label Generation</h1>
<p> For this exericse, you will look into a simple trick to automatically generate training data for Mask-RCNN. You will work exclusively in <script>document.write(notebook_link('label_generation', deepnote['exercises/segmentation'], link_text='this notebook'))</script>. You will be asked to complete the following steps: </p>
<ol type="a">
<li> Automatically generate mask labels from pre-processed point clouds.
</li>
<li> Analyze the applicability of the method for more complex scenes.</li>
<li> Apply data augmentation techniques to generate more training data.</li>
</ol>
</exercise>
<exercise><h1>Segmentation + Antipodal Grasping</h1>
<p> For this exercise, you will use Mask-RCNN and our previously
developed antipodal grasp strategy to select a grasp given a point cloud.
You will work exclusively in <script>document.write(notebook_link('segmentation_and_grasp', deepnote['exercises/segmentation'], link_text='this notebook'))</script>. You will be asked to complete the
following steps: </p>
<ol type="a">
<li> Automatically filter the point cloud for points that correspond to
our intended grasped object</li>
<li> Analyze the impact of a multi-camera setup.</li>
<li> Consider why filtering the point clouds is a useful step in this
grasping pipeline.</li>
<li> Discuss how we could improve this grasping pipeline. </li>
</ol>
</exercise>
</section>
</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<div id="references"><section><h1>References</h1>
<ol>
<li id=Russakovsky15>
<span class="author">Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and others</span>,
<span class="title">"Imagenet large scale visual recognition challenge"</span>,
<span class="publisher">International journal of computer vision</span>, vol. 115, no. 3, pp. 211--252, <span class="year">2015</span>.
</li><br>
<li id=Lin14a>
<span class="author">Tsung-Yi Lin and Michael Maire and Serge Belongie and James Hays and Pietro Perona and Deva Ramanan and Piotr Doll{\'a}r and C Lawrence Zitnick</span>,
<span class="title">"Microsoft coco: Common objects in context"</span>,
<span class="publisher">European conference on computer vision</span> , pp. 740--755, <span class="year">2014</span>.
</li><br>
<li id=Russell08>
<span class="author">Bryan C Russell and Antonio Torralba and Kevin P Murphy and William T Freeman</span>,
<span class="title">"LabelMe: a database and web-based tool for image annotation"</span>,
<span class="publisher">International journal of computer vision</span>, vol. 77, no. 1-3, pp. 157--173, <span class="year">2008</span>.
</li><br>
<li id=Marion17>
<span class="author">Pat Marion and Peter R. Florence and Lucas Manuelli and Russ Tedrake</span>,
<span class="title">"A Pipeline for Generating Ground Truth Labels for Real {RGBD} Data of Cluttered Scenes"</span>,
<span class="publisher">International Conference on Robotics and Automation (ICRA), Brisbane, Australia</span>, May, <span class="year">2018</span>.
[ <a href="https://arxiv.org/abs/1707.04796">link</a> ]
</li><br>
<li id=Whelan16>
<span class="author">Thomas Whelan and Renato F Salas-Moreno and Ben Glocker and Andrew J Davison and Stefan Leutenegger</span>,
<span class="title">"{ElasticFusion}: Real-time dense {SLAM} and light source estimation"</span>,
<span class="publisher">The International Journal of Robotics Research</span>, vol. 35, no. 14, pp. 1697--1716, <span class="year">2016</span>.
</li><br>
<li id=Northcutt21>
<span class="author">Curtis G Northcutt and Anish Athalye and Jonas Mueller</span>,
<span class="title">"Pervasive label errors in test sets destabilize machine learning benchmarks"</span>,
<span class="publisher">arXiv preprint arXiv:2103.14749</span>, <span class="year">2021</span>.
</li><br>
<li id=Long15>
<span class="author">Jonathan Long and Evan Shelhamer and Trevor Darrell</span>,
<span class="title">"Fully convolutional networks for semantic segmentation"</span>,
<span class="publisher">Proceedings of the IEEE conference on computer vision and pattern recognition</span> , pp. 3431--3440, <span class="year">2015</span>.
</li><br>
<li id=Girshick14>
<span class="author">Ross Girshick and Jeff Donahue and Trevor Darrell and Jitendra Malik</span>,
<span class="title">"Rich feature hierarchies for accurate object detection and semantic segmentation"</span>,
<span class="publisher">Proceedings of the IEEE conference on computer vision and pattern recognition</span> , pp. 580--587, <span class="year">2014</span>.
</li><br>
<li id=Girshick15>
<span class="author">Ross Girshick</span>,
<span class="title">"Fast r-cnn"</span>,
<span class="publisher">Proceedings of the IEEE international conference on computer vision</span> , pp. 1440--1448, <span class="year">2015</span>.
</li><br>
<li id=Ren15>
<span class="author">Shaoqing Ren and Kaiming He and Ross Girshick and Jian Sun</span>,
<span class="title">"Faster r-cnn: Towards real-time object detection with region proposal networks"</span>,
<span class="publisher">Advances in neural information processing systems</span> , pp. 91--99, <span class="year">2015</span>.
</li><br>
<li id=He17>
<span class="author">Kaiming He and Georgia Gkioxari and Piotr Doll{\'a}r and Ross Girshick</span>,
<span class="title">"Mask {R-CNN}"</span>,
<span class="publisher">Proceedings of the IEEE international conference on computer vision</span> , pp. 2961--2969, <span class="year">2017</span>.
</li><br>
<li id=Bommasani21>
<span class="author">Rishi Bommasani and Drew A Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and others</span>,
<span class="title">"On the opportunities and risks of foundation models"</span>,
<span class="publisher">arXiv preprint arXiv:2108.07258</span>, <span class="year">2021</span>.
</li><br>
<li id=Radford21>
<span class="author">Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and others</span>,
<span class="title">"Learning transferable visual models from natural language supervision"</span>,
<span class="publisher">International Conference on Machine Learning</span> , pp. 8748--8763, <span class="year">2021</span>.
</li><br>
</ol>
</section><p/>
</div>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=clutter.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=deep_perception.html>Next Chapter</a></td>
</tr></table>
<div id="footer" pdf="no">
<hr>
<table style="width:100%;">
<tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">© Russ
Tedrake, 2022</td></tr>
</table>
</div>
</body>
</html>