-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathresearch.tex
319 lines (218 loc) · 39.5 KB
/
research.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
\chapter{Approach} \label{approach}
In this chapter, the work towards reconstructing a moving human in 3D is described. Starting out, there was much to be figured out and tried. Many questions were not only unanswered but still unasked. The goal was already defined: reconstruct a rigged 3D mesh of the user. The tools available were a Microsoft Kinect for Xbox 360 and a desktop computer.
\section{Obtaining point cloud data}
The first step towards achieving anything with a Kinect is to read its sensor data to the computer. This is not trivial considering that Kinect was originally only meant to be an Xbox 360 accessory. Each of the three drivers mentioned in section \ref{literature.drivers} were installed and tested by writing simple software using them.
Skeleton tracking using the Microsoft SDK was tested by compiling a sample application and making modifications to it. The API seems to be good and usable, and the skeleton tracking methods work quite well. Fast movement and occlusion caused problems such as the skeleton jumping into weird poses for a short time.
OpenNI was tested using the SimpleOpenNI \citep{simpleopenni} wrapper for Processing \citep{processing}. The included examples were plentiful and diverse, and allowed us to quickly try out our own ideas (probably in part due to our familiarity with the Processing environment).
Of the three drivers, libfreenect was the easiest to start working with. By installing the OpenKinect plug-in for Processing \citep{shiffman2010} \citep{processing}, the Kinect was up and running in minutes. Development was easy given the bundled examples, which include creating point clouds as shown by \citet{fisher2010}. Apparently the functionality is still quite low-level, and thus not very suitable for trying to capture human body details.
Limited resources were allocated for this research project, so practically skeleton tracking was a requirement for the sensor software. The choice was then between completely proprietary, single-platform Microsoft SDK with a highly restrictive license and the partially proprietary, multi-platform OpenNI/NITE combination with unclear licensing. In preliminary testing, differences in the accuracy of skeleton tracking were minor.
OpenNI was finally chosen to be used for data acquisition in this work. The Microsoft SDK would probably have been similarly useful, but was not chosen because OpenNI allows for multiple platforms and is more open.
\section{Experiments with point clouds} \label{pcexperiments}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{pcd-plain.png}
\caption{A point cloud constructed from a single frame of Kinect data. The cloud is captured and saved in the PCD (Point Cloud Data) format that PCL uses \citep{pcdspec}. The viewer used is part of PCL.}
\label{fig:pcd-plain}
\end{figure}
As we receive data from the sensor, we need a representation for it. Physically, the Kinect has two cameras: one RGB and one IR. The data received from the sensor is an 8-bit Bayer-filtered image and an 11-bit depth image computed onboard from the IR data; both have a resolution of 640x480 and a framerate of 30 fps. It is possible to request different resolutions or the raw IR image, but these are the most sensible choices. The drivers interpolate the Bayer image to get a 640x480 RGB image.
Converting the data to a point cloud allows more sophisticated processing than using the plain images. A point cloud represents the actual real-world geometry of the observation, allowing interpreting the point locations as 3D vectors. Computation using vectors tends to be intuitive, making it relatively easy to devise an algorithm for a given task. Some tasks are possible to do directly using the image data, but the algorithms are mostly unintuitive and difficult to discover. Image-based algorithms tend to be computationally very efficient compared to algorithms operating on 3D point sets.
The point cloud representation allows geometrical operations, one practical application of which is to render the data from another viewpoint as seen in figure~\ref{fig:pcd-plain}. This kind of 3D rendering with a freely moving camera was used to get acquainted with the data of Kinect. Looking at the point cloud data from Kinect in real time helps notice the kind of noise inherent to Kinect. The most evident types of errors in the data are:
%
\begin{description}
\item[Jitter in depth measurements.] The depth of a single pixel in a static scene may seemingly randomly change between two or three values. Possible explanations for this are that the actual distance is between two possible discrete values, or that the point is almost at an edge and could be interpolated to be on either side. Very small differences in the observation then tip the balance.
\item[Unaligned edges between the RGB and depth image.] Points in the foreground have the color of the background and vice versa. This happens near the edges, especially when an object is close to the sensor.
\item[Missing points.] Because the depth is measured from disparities in a known laser pattern, the measurement cannot be made if a laser point cannot be reliably identified. This can be caused by occlusion, intensive light that covers the laser pattern, materials that absorb the IR wavelength used by the laser or areas that are so white that the IR sensor is saturated. As the result, the depth is set to a special value meaning it is unknown, and the point is not included in the point cloud.
\item[Systematic errors.] Planes tend to be slightly curving---this means the conversion from depth image to world coordinates is inaccurate. This might be possible to overcome by careful calibration, but the error is small enough not to matter very much.
\item[Biased normals.] Round objects seem flatter around the edges than they actually are. No measurements can be made where the surface is tangential towards the camera, and if there's a laser point on a nearly-tangential surface, its intensity is too low to be measured. Because depth values for most pixels are in fact interpolated from nearby measurements, and measurements can only be made on the relatively flat, camera-facing parts, objects seem to be flat. In other words, continuous depth variations in a strip of neighboring pixel are moderate at most.
\end{description}
Further experimentation on point clouds was made by compiling PCL and trying its examples. We made simple C++ code of our own utilizing PCL and OpenNI to grab a frame from Kinect and save it as a point cloud in the PCL file format. Point clouds captured this way could then be used as input for applications included in the PCL distribution. We examined the convex and concave hulls of point clouds to see if they might be useful. We tested fast mesh generation, which allows viewing an approximate surface of the Kinect view in real time, albeit with a lot of noise. We also compiled Kinfu and experimented with it, even making some changes of our own to it---more on that in sections \ref{approach.autorig} and \ref{approach.parallel}. One bug in the PCL IO module was also found and fixed while writing our own code\footnote{See \url{http://dev.pointclouds.org/issues/812}}.
% TODO: more \bs{important} things done with PCL
\section{Body part segmentation} \label{approach.segmentation}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{pcd-segmented.png}
\caption{The same point cloud as in \ref{fig:pcd-plain}, with the user detached from the background and further segmented into individual body parts.}
\label{fig:pcd-segmented}
\end{figure}
Before any attempts at human body reconstruction can be made, the body needs to be detected and segmented from the RGB-D image. The NITE middleware for OpenNI \citep{NITE} includes functions for person detection, so we did not put effort into looking for alternative approaches. Moreover, NITE has the capability to generate a skeleton representation of human users.
We considered different possible approaches at body reconstruction, and our ideas mostly required considering different body parts such as limbs as individual (though connected) entities. As NITE provided a skeleton of the user, utilizing it was the easiest way to recognize different body parts.
The body parts used were chosen according to what could be constructed from the joints provided in the NITE skeleton. Not all the joints are really joints in the human body, but they are treated as such in the simplified skeleton used by NITE. The joints that can be used are head, neck, torso, shoulders, elbows, hands, hips, knees and feet. NITE also has joints that exist in the API (Application Programming Interface) but whose positions are never updated: waist, collars, wrists, fingertips and ankles.
Using these joints, we connected pairs to make bones and constructed the following body parts that we segment the point clouds to:
%
\begin{itemize}
\item head and neck
\item shoulders (left and right)
\item lower chest
\item abdomen (left and right)
\item upper arm (left and right)
\item forearm and hand (left and right)
\item thigh (left and right)
\item leg\footnote{Specifically, we use \term{leg} to mean the part of the lower limb between the knee and the ankle. This is the meaning commonly used in anatomical context.} and foot (left and right)
\end{itemize}
%
Figure~\ref{fig:pcd-segmented} shows a body segmented according to these parts.
Body part segmentation as described was implemented in Processing using SimpleOpenNI. The method used is simple: for each point, compute the distance to all bones and choose the nearest one. This is not optimal, but could be improved later on. A possible optimization would be to only compute the distance to all bones for every $n$th point, and only consider the two closest bones for neighboring points. Another one would be to project the bones to the image coordinates, and only compute 3D distances where the segmentation is not obvious in 2D.
After seeing the segmentation in action, we decided further study of the point clouds and skeleton positions frame by frame would be useful. To this end, we created a software module for writing the PCD file format used by PCL in Processing based on its specification \citep{pcdspec}. This allowed making recordings of the point clouds that also contain body part index and RGB color as additional fields, something the file format has built-in support for. Furthermore, the PCD viewer included in the PCL distribution allows visualizing this---the colored body parts in figure~\ref{fig:pcd-segmented} were achieved with no modifications to the viewer application.
We also wanted the option to experiment with the original data: the skeleton joint positions. As the PCD file format supported comments (beginning with a \#), we found it simplest to invent a header format that is interpreted as a comment in existing applications but could be read by software of our own. Skeleton data is required to be in the beginning of the file, before the PCD header and data. Saving more than one skeletons is supported, and each skeleton definition begins by stating the user:
%
\code{
\# user 1
}
%
Then each joint position is listed along with its position, rotation and the confidence values NITE has assigned them. The rotation is saved as a quaternion for compactness. If there is no rotation assigned, as is the case for the head, hand and foot joints, the identity rotation with confidence 0.0 is saved. Each joint is then defined by one line that looks like
%
\code{
\# right\_arm $p_x$ $p_y$ $p_z$ $c(p)$ $q_x$ $q_y$ $q_z$ $q_w$ $c(q)$
}
%
where $p$ is position, $c$ is the confidence of a value and $q$ is the rotation quaternion. After the skeleton definitions comes the standard PCD header.
For experimentation, we created functions in MATLAB and Python (NumPy) for importing the plain-text PCD file format\footnote{There are actually two different data formats supported in the PCD specification: the plain-text ASCII format where each point is on its own line, and a binary format where the data is a direct memory dump of the point cloud data structure. PCL includes a utility to convert between these formats, so we found it currently unnecessary to support the binary format in our own tools.} with our own skeleton extension. Some experimentation and plotting of the data was made in both, but generally PCL seemed to be a better platform for working with point clouds than either MATLAB or Numpy.
\section{Point cloud alignment} \label{approach.alignment}
The naïve approach to modeling uses data only from a single frame. However, this is unsatisfactory as in practice only less than a half of a human can be seen at once---when the front side is visible, the back is not. Another consideration is that the data tends to be noisy and inaccurate. Accumulating data over time makes it possible to remove some of the noise and improve accuracy.
To allow for combining data from multiple frames, the observations (point clouds) need to be aligned one way or another. Methods introduced in \autoref{literature.alignment} were available, each with their own trade-offs.
We considered the different methods and tested implementations available. Kinfu already uses ICP, so the source code was studied and alterations were made to see the effects. Different registration algorithms included in PCL were studied on the level of the API documentation. It was concluded that registration is not a simple task, and the ICP implementations used by KinectFusion and Kinfu are highly optimized GPGPU code very difficult to match performance-wise. This discouraged us from further investigating any algorithms that only have CPU implementations.
Of course, we could look at making our own GPGPU implementations of registration algorithms. But such a task would require a major effort and there would be no guarantee of real benefit. This would have deviated too far from the focus of our research, so we had to be content with existing registration algorithms that run on the GPU.
The CUDA implementations of EM-ICP and Softassign published by \citet{tamaki2010softassign} were downloaded. Their compilation posed slight problems as they used OpenGL Utility Library (GLU) functions that were deprecated and then removed years ago. After some modifications to the code and build system, the implementations were successfully compiled. The algorithms seem to be more robust than the standard ICP, but they are significantly slower. Quick experimentation showed that these registration methods would not be suitable for real-time operation.
\subsection{Point cloud accumulation} \label{approach.accumulation}
\begin{figure}
\centering
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{klaus1}
\end{minipage}
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{klaus2}
\end{minipage}
\caption{The accumulated point cloud of a user, shown in two different poses.}
\label{fig:klaus}
\end{figure}
Another approach was suggested by Klaus Förger from our research group, led by professor Tapio Takala. The idea is to gather point sets for different versions of each body part, depending on joint angles and orientation relative to the camera. Instead of trying to align point clouds, we would try to gather a lot of data and then use statistical methods such as averaging to avoid outliers.
This approach extends the assumption that each body part is a rigid object by capturing separate data for different poses. This might allow for more accurate reconstruction, given sophisticated methods for modeling the body surface from the data. On the other hand, the amount of data is larger and finding a suitable method for creating the rigged mesh is more difficult. Ultimately, this approach would require further experimentation on how different surface reconstruction approaches work with the data.
A prototype of the accumulation code was made by Klaus Förger---figure~\ref{fig:klaus} contains a screenshot. The implementation includes rendering of the posed skeleton, along with point clouds gathered with the current joint angles and orientations. Heuristics are used to omit unreliable observations, such as ones with fast movement. The RGB and depth images from Kinect tend to be slightly out of sync, causing errors relative to movement speed in the 3D coordinates of the points. This is most apparent in the edge between foreground and background.
\section{Creating a rigged mesh} \label{approach.mesh}
The major objective in our research is to generate a mesh of the user, and rig it so that it can be animated. All the other work we have done has been made towards this goal---to transform the data such that creating a rigged mesh is possible. Finding the surface mesh and rigging it can be considered two separate tasks, or they can be done simultaneously. The choice of approach affects both tasks.
We made different approaches to this problem, and in this section we describe each approach taken.
\subsection{Geometric surfaces} \label{approach.cylinders}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{cylinder2}
\caption{A screenshot of our application showing the ``cylinder man'', with point cloud display off. The colors are average colors of the user, and the pose is the same as the user has at the moment.}
\label{fig:cylinders}
\end{figure}
% TODO: expand this section a bit
As the very first prototype, a simple cylinder approximation was used. A cylinder was fitted to the point cloud of a body part, and colored according to its average color---see figure~\ref{fig:cylinders} for a screenshot. This was good for testing the segmentation and getting a general idea of how accurate the skeleton is. Some corner cases were also found using this prototype. For example, when a lot of occlusion occurs the NITE skeleton may begin glitching. The body part segmentation also has errors at times when bones are very near each other.
Obviously, a more human-like model was needed. The ``cylinder man'' could be usable as an avatar, and is certainly recognizable as human-like. But certainly its shape was nowhere near a real human body. Different simple shapes such as ellipsoids were considered. This would still have left the problem of connecting the body parts. A solution based on meta-blobs might have been feasible, but in the end this approach started to seem quite cumbersome. And still the accuracy would have left much to be hoped for.
\subsection{Kinfu and automatic rigging} \label{approach.autorig}
\begin{figure}
\centering
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{pinocchio-walking1}
\end{minipage}
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{pinocchio-walking6}
\end{minipage}
\caption{The mesh scanned using Kinfu and rigged using Pinocchio. Pinocchio comes with pre-recorded motions; here the mesh is animated using a circular walking motion. The images are from different moments of the animation.}
\label{fig:hannu-front}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{pinocchio-back}
\caption{The back of the mesh shows inconsistencies in the right hand and leg. The automatically rigged skeleton is shown in olive.}
\label{fig:hannu-back}
\end{figure}
\citet{charpentier2011accurate} shows how a rigged mesh model of the user can be made by manually aligning shots from different directions and then using an automatic rigging algorithm. \citeauthor{charpentier2011accurate} claims this approach would become very easy, if the KinectFusion system were available.
As the Kinfu\footnote{Kinfu is an open source implementation of KinectFusion \citep{newcombe2011kinectfusion}. It is included in the Point Cloud Library \citep{PCL}.} implementation was available, this approach was tested in practice. The amount of manual work required turned out to still be surprisingly large.
The software needed for this approach are a system for scanning the mesh, a mesh editor for making necessary changes in the mesh and an auto-rigging software. We used Kinfu, MeshLab and Pinocchio respectively for these purposes. \citet{charpentier2011accurate} had also successfully employed MeshLab as the mesh editor and Pinocchio as the auto-rigging software. At the time of writing, Kinfu and Pinocchio needed to be compiled manually. Both have command-line interfaces that take a little getting used to; similarly, MeshLab's GUI (Graphical User Interface) is slightly confusing for a beginner.
The actual scanning requires two persons: one to be scanned and one to scan. The scanned subject needs to remain still while the other person carefully goes around them, slowly moving the Kinect so that as much of the body as possible can be seen. After this is done, the user presses a key to extract a mesh from the TSDF. This operation can only be completed if the GPU has at least 1.5 gigabytes of memory; otherwise, it's possible to change Kinfu source code to lower the TSDF resolution and recompile. The extracted mesh is then saved to disk by pressing another key.
The extracted mesh contains the subject and the surrounding area. The subject needs to be detached from the environment for the automatic rigging. This requires manual mesh editing. In our test run, the full mesh was 140 MB in size and comprised some 5 million vertices and 1.5 million faces. Editing this mesh made MeshLab a bit sluggish on a high-end desktop computer\footnote{Intel Core i7-3770, 16GB RAM, SSD and an Nvidia GeForce GTX670.}. For comparison, the same mesh was imported in Blender, after which Blender became stuck for about a minute and then showed the mesh while remaining quite unusable.
After the mesh is cropped and only the human part remains, the actual editing can begin. Pinocchio needs a mesh that is closed and connected, and the one produced by Kinfu is usually neither. One possible approach to this problem is to uniformly sample a subset of the vertices and then use a mesh generation algorithm on those vertices. The downside is that some accuracy is lost, but on the other hand the regions with holes are nicely filled. Whatever approach is taken, it might take some trial and error to choose good algorithms and parameters. The computation is not instant, but not necessarily slow either, again depending on the exact methods used.
With this manual editing of the mesh finished, the next step is to complete the automatic rigging using Pinocchio. In our tests, the first two tries failed because Pinocchio expects the mesh to be oriented in a certain way. This is not difficult to fix, but manual intervention is still needed at this stage.
Finally, Pinocchio successfully rigged the mesh and a walking animation of the user was seen on the screen---quite astonishing! Minor details such as slightly wrong placement of the shoulder joints made the animation a little unnatural, but the result as seen in figure~\ref{fig:hannu-front} still causes amazement. The mesh also has rough surface details, caused by drift in the camera position and lack of loop closure in Kinfu. This is shown in figure~\ref{fig:hannu-back}.
All in all, the amount of manual work required to create a rigged mesh from scratch was about an hour with some earlier experience of the tools. The process could possibly be automated, but implementing such an automatic system is far from trivial. The scanning phase would still necessarily need another person to move the Kinect around, and the time requirements would probably be around a minute for scanning and five minutes for processing. The method is thus feasible, but not as easy as \citet{charpentier2011accurate} suggests. Furthermore, it does not meet our requirement of being able to scan a moving person.
\subsection{Parallel Kinfu for each body part} \label{approach.parallel}
\begin{figure}
\centering
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{limbwise-head-ray.png}
\end{minipage}
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{limbwise-head-depth.png}
\end{minipage}
\caption{Kinfu running on a view with only the head. The left view shows the raytraced TSDF. On the right, the raw depth data of the most recent frame is shown.}
\label{fig:limbwise-head}
\end{figure}
\begin{figure}
\centering
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{limbwise-arm-ray.png}
\end{minipage}
\begin{minipage}{0.49\textwidth}
\includegraphics[width=\textwidth]{limbwise-arm-depth.png}
\end{minipage}
\caption{Kinfu running on a view with only the left arm.}
\label{fig:limbwise-arm}
\end{figure}
One possible approach for mesh reconstruction is to use a voxel grid similar to the one KinectFusion \autocites{newcombe2011kinectfusion}{izadi2011kinectfusion} uses. This allows generating meshes representing arbitrary shapes.
To evaluate the approach, we decided to build a prototype on top of Kinfu. Since the point clouds were already segmented by body part, it was possible to create recordings that only include a single body part. These recordings could then be played back and used as input for Kinfu.
The working hypothesis was that each body part could be treated as a static object, and that Kinfu should do quite well at modeling them. Notably there's little difference between the camera moving (as is the case in KinectFusion) and the object moving. This made for the case that running the KinectFusion algorithm in parallel to each body part could give reasonably good results.
To begin the experiments on how Kinfu works with a view of a single body part, we needed to split one Kinect recording into multiple ones, each including just one specific body part. This was done in the prototype by creating an AWK script that parses and splits a PCD file. At first, the created recordings didn't work. By trial and error, we understood that Kinfu expects to receive a full frame of data. We then changed the AWK script to replace all excluded points with unmeasured, `missing' points. After this was confirmed to work, the process of splitting the recordings was automated by writing a Bourne shell script that calls the AWK script.
Modifications to Kinfu itself were made as needed. Originally, Kinfu would only play a recording once and then stop, without quitting. This behavior was changed so that Kinfu quits after the recording---otherwise automating Kinfu runs would have been impossible. We then added a command-line option to repeat the recording, to make watching a recording without suddenly quitting possible. To actually get the results of the run, we added a command-line option to extract and save a mesh on exit. Finally, screen capturing was implemented and an option added to save each frame as PNG.
After the necessary changes to Kinfu were completed and working, the experiment was further automated. We created a shell script that takes a recording, splits it by body parts and runs Kinfu on each one of them. The console output is logged and the resulting mesh saved to disk. Automatically capturing frames of the Kinfu run is also possible. This level of automation made tests bearable even when the processing was very slow at first. We would make a recording and then leave the computer making the automated runs for a couple of hours, and then see what the result looked like.
Reading the recordings from disk instead of directly from Kinect turned out to be very slow. We were able to significantly speed up the process by converting the PCD files to the binary format, which allows using them as memory-mapped files. A minor improvement was still achieved by reading the files from a ramdisk instead of the physical drive. Still, the runtimes per frame were similar to normal Kinfu although the amount of data for a single body part was a fraction of the full data. Therefore, this approach would require intensive optimization to be possible in real time---if even then.
However, the performance is not the greatest issue. In practice, Kinfu doesn't work well at all with isolated limb data. There are multiple reasons for the substandard outcome.
\begin{description}
\item[Few points.] For full-body scanning the whole user must obviously fit in the picture. At the VGA resolution that Kinect uses, this leaves few pixels per limb. To make matters worse, the useful resolution of the Kinect depth image is actually far below VGA. This is due to the fact that the depth is actually measured by triangulation of a known laser pattern consisting of 34749 dots\footnote{The Kinect pattern consists of a 3861-dot subpattern that is repeated 3x3 times \citep{reichinger2011}. Therefore, there are 34749 dots in total.}. So if every dot could be observed and recognized, there would be a depth measurement for about every 9th pixel in the VGA depth image. The rest of the depth values are interpolated. As a gross simplification, it could be said that the depth measurements of Kinect should fit in a 210x160 depth map---but no research has been made about this.
\item[Spatial inaccuracy.] The depth resolution of Kinect is about 2 centimeters at a distance suitable for full-body scanning. This means significant uncertainty in point coordinates. Considering the noise in the depth values as described in section~\ref{pcexperiments} and the aforementioned scarcity of depth measurements, this is troublesome.
\item[Lack of geometric features.] ICP only uses surface features for alignment, not color data. As the individual body parts tend to be quite smooth, this is a problem. There are differences between body parts---the head has most geometric features and its reconstruction works almost satisfactorily if the person keeps their neck still, as seen in figure~\ref{fig:limbwise-head}. Other body parts are more problematic: the ICP algorithm has very few details to match, and rotation cannot be reliably identified. The result is a surface that is difficult to recognize both in relation to human body parts and the depth image (see figure~\ref{fig:limbwise-arm}).
\item[Movement in body parts.] This approach is based on the assumption that each body part is a static, rigid object. Realistically, the body part comprising the forearm and hand, for example, is difficult to keep rigid even if you try. Using a rigid registration algorithm such as ICP on a task like this is susceptible to fail.
\end{description}
These problems led us to consider this approach unsalvageable for now. The idea is good, but the implementation would significantly benefit from a higher-resolution, more accurate sensor and a more suitable registration algorithm than ICP.
\newtopic
Due to the work already done on Kinfu it was easy to make minor changes for recording trajectories. This proved useful in other research, which yielded one conference paper \citep*{tykkalavisapp} and one journal article \citep*{tykkalavcir}. Both are accepted to be published in 2013.
\subsection{Parametric human body model} \label{approach.makehuman}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{makehuman-measurement.png}
\caption{MakeHuman with the Measurement plugin active. Waist circumference is selected. The white line shows where the circumference is measured.}
\label{fig:makehuman-measurement}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{mhtargets.png}
\caption{The reimplementation of the MakeHuman target system, showing the unmodified base mesh.}
\label{fig:mhtargets}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{mhtargets-fat.png}
\caption{The model in \ref{fig:mhtargets} with changed front chest, bust, underbust, waist and hip measurements.}
\label{fig:mhtargets-fat}
\end{figure}
Another approach is to use the information readily available about the body parts being modeled. Instead of allowing arbitrary shapes, the space of possible shapes can be limited to what the body parts tend to look like.
Our ``cylinder man'' approach described in section~\ref{approach.cylinders} is partly based on the same ideology: limbs look (slightly) like cylinders, so using cylinders to represent them is a workable approximation. However, the shortcomings of modeling the human body as geometric shapes were found to be too great to bypass. The best parametric model for the purpose would then be something that already assumes the generic human shape, while allowing modifications to single body parts. The SCAPE model \citep{anguelov2005scape} used by \citet{weiss2011home} fits the description, but is not suitable for real-time evaluation. A more appropriate approach is taken by MakeHuman \citep{makehuman}, which uses a base mesh and has defined parameters that can be used to reshape different parts of the mesh.
Especially the Measurement plug-in in MakeHuman (see figure~\ref{fig:makehuman-measurement}) was promising, as it contained a good amount of parameters that mostly did not affect overlapping parts of the base mesh. For example, there are targets for arm length and circumference.
MakeHuman is still in development and at the time of writing was undergoing some large changes. It is not designed to be used as a library, either. Interfacing with other software thus seemed nontrivial. After some time tinkering with the MakeHuman implementation, it was decided that the modeling approach (including the base mesh and targets) is important while the software can be replaced.
The \term{target} system in MakeHuman (see section \ref{literature.makehuman}) was implemented in Processing. Our version reads the same target definition files as MakeHuman itself. The application is displayed in figure~\ref{fig:mhtargets} showing the plain base mesh and in figure~\ref{fig:mhtargets-fat} with some visible modifications to target weights.
The base mesh was also modified to only include the parts that should be drawn, while keeping mesh indices the same. The original base mesh contains octaedra representing some kind of control points, and for some reason, a dress that mostly covers the body.
\newtopic
Finding the correct target weights was still a problem. Different approaches were considered. The most naïve idea was to project the input point cloud to each body part and optimize the error between the point cloud and the mesh surface. This would require little knowledge of the effect of the targets, as different weights would be tried until good ones were found. By doing some good choices, such as optimizing length before circumference, this could be doable in real-time if the code was optimized.
Still, the idea of black-box optimization with expensive error computation on each iteration left questions about the implementation, and was not easy to begin working on. A better idea would be to make some measurements on the point cloud of a body part, and then make similar measurements on the mesh for each iteration of the optimization. This would significantly lessen the computational cost.
Making simple measurements to body part--specific point clouds was implemented. Initially we chose to measure bone length and average distance to bone. These values are saved on each frame, and can be exported in bulk.
We took the values to a spreadsheet and tried to find ways to calculate MakeHuman measurements from them. Most relations between the two are straightforward, but mostly with slight differences that should be fixable with a linear function. For example, MakeHuman has neck length while the NITE skeleton puts the head ``joint'' inside the head, near top. As the measurement is defined differently, the correct value might be 22cm in NITE skeleton and 12cm in MakeHuman. More difficult cases are found especially in the torso, where different measurements on the point clouds might be needed.
We contrived functions for converting the measurements. Now we could get a satisfactory body model by feeding the calculated values to MakeHuman. However, our choices of functions are mildly arbitrary, and they should be researched better. We have anecdotal evidence that two different persons are modeled differently and mostly plausibly using the current system, while they do look more similar to each other than in reality.
At this point, our time was limited and no further research was made. The relation of point cloud measurements on humans and suitable MakeHuman parameter values should be surveyed systematically. Once the proper functions are decided, automating the computation and creation of the MakeHuman model will be easy.
% TODO: should this be said?
% However, considering how the target system works, more deduction can be done beforehand. The magnitude of a target's effect is linearly defined by its weight. Therefore, if a measurement of the mesh is known for two values of a target weight, the weight matching another measurement value can be interpolated. This does require some assumptions: the targets must not overlap, the weights must be in reasonable limits (for example, the mesh must never clip itself) and the measurements must be linear.
\section{Texturing} \label{approach.texturing}
No actual implementation of texturing the mesh was made in this project, but possible approaches were considered. Of course, simple mixing of colors to get an overall color for each body part was tested for the simple ``cylinder man'' approach, and can be seen in figure~\ref{fig:cylinders}. In this section we describe and analyze the ideas we have for texturing, while acknowledging that no implementations were made.
A simple way to get some more colorful details in the mesh would be to assign each vertex the color of the nearest point. This would allow visually representing some major details, such as the edge between a t-shirt and skin, or a colorful image on a shirt, or different color of face and hair.
For a better level of detail, an actual texture would be needed. Different approximations can be chosen to project the points to the surface (assuming they are not exactly on the surface, which they rarely are). One would be to use the surface normal coinciding with the point, and project the point on the normal's intersection with the mesh. Similarly, a normal of the bone could be used (where any line starting from the end of the bone and at over a 90 degree angle to the bone is considered suitable, too).
These approaches would work best by first finding the base color, as already implemented, and then adding local colors as appropriate. Each point should make a small ``splat'', and the different colors that coincide should be averaged. The choice of color space for averaging should be made based on practical evaluation. We briefly tested averaging in the HSB (hue, saturation, brightness) and RGB spaces on the ``cylinder man'' and the color mixing results seemed to be more intuitive in RGB space.
A confidence measure could be used, so that points far from the surface have a smaller effect on the final color. Practically a Gaussian function centered at zero
\begin{align*}
f \colon [0, \infty) & \to \mathbb{R}_+ \\
x & \mapsto a e^{-b x^2}
\quad \mid a, b \in \mathbb{R}_+
\end{align*}
could be used for the weight of a point as a function of its distance to the surface, and also to control the color intensity of the splat (as a function of distance to its center).
Another method of coloring the texture could be to pose the created model to correspond with silhouettes in the RGB images. Then the projection would be made through a camera ray from the image plane to the surface. This would practically be difficult to implement, as optimally the texturing should be done for previous frames using the current mesh. The texture segments from different frames should then be somehow stitched together.
One notable problem about using the RGB data for texturing are the non-uniform lighting conditions. Some pixels are in the shadow and too dark, while others are overexposed. This problem is very difficult to avoid, as it would require knowledge of the lighting and inverse computation of shading.