I will arrange all of the notes properly once I get a better idea of how it works.
- Deepfake paper
- Pixel shuffling paper
- Article on super-resolution
- Umeyama algorithm paper
- Article on transformations of images
- Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network i.e pixel shuffling paper
- Useful article in super-resolution
- Useful article to understand the concept of transformation matrices
This required three main parts:
- An encoder that encodes an image into a lower dimensional vector.
- A decoder that reproduces face A from the encoded vector
- Another decoder that reproduces face B from the encoded vector.
This image from the original paper summarises it better than anything else:
Generally, super-resolution operation is done on the high resolution space, but in this paper, they proposed a new way to do it in the low resolution space, which in turn would require less computational power.
The SR operation is effectively a one-to-many mapping from LR to HR space which can have multiple solutions. A key assumption that underlies many SR techniques is that much of the high-frequency data is redundant and thus can be accurately reconstructed from low frequency components.
Important full form: PSNR = Peak Signal to Noise Ratio. Higher means good quality and low means bad quality w.r.t the original image. It is measured in decibels (dB). Here's the wikipedia page
- Increasing the resolution of the LR images before the image enhancement step increases the computational complexity.
- Interpolation methods such as bicubic interpolation do not bring additional information to solve the problem.
Contrary to previous works, they increase the resolution from LR to HR only at the very end of the network and super-resolve HR data from LR feature maps. It's advantages are;
- Requires lower computational power.
- Not using an explicit interpolation filter means that the network implicitly learns the processing necessary for SR.
In transpose convolutions, upsampling with strides adss zero values to upscale the image which are to be filled later on. Maybe even worse, these zero values have no gradient information that can be backpropagated through.
While sub-pixel convolutional layers essentially uses regular convolutional layers followed by a specific type of image reshaping called a phase shift. Instead of putting zeros in between pixels and having to do extra computation, they calculate more convolutions in lower resolution and resize the resulting map into an upscaled image. This way, no meaningless zeros are necessary.
Some parts written above are quoted from this repository
Phase shift is also called “pixel shuffle”, which is a tensor of size H × W × C · r²
to size rH × rW × C
tensor as shown below. An implementation of this operation can be found here.
If 2 point patterns are given, what is the set of similarity transformation parameters that give the least mean squared error between the patterns ?
And this is exactly what the umeyama algorithm does, it finds a set of similarity transformation parameters (rotation, translation, scaling) that minimizes the MSE Loss between the patterns.
Note
- The transformed pattern has the minimum possible MSE loss w.r.t the target pattern.
- The transformed pattern is similar to the source pattern.
The two important points to note are:
- The set of similarity transformations applied to the original image is such that the MSE loss between the target image and the distorted image has been minimized
- The target image is similar to the original image (i.e no distortions)
Now what is a similarity transformation matrix ?
It represents a set of operations that can be done on a matrix A
to get another matrix B
that is similar to A
Each value in a transformation matrix in 2D represents the following:
[[size, rotation, location], ←x-axis
[rotation, size, location]] ←y-axis
A default matrix or one that wouldn’t change anything would look like;
[[1, 0, 0]
[0, 1, 0]]
Or if we want to alter just the width (half the width along the x axis), it would look like:
[[0.5, 0, 0], #x
[0, 1, 0]] #y
We're taking only the first 2 indices of the transformation matrix because the third one represents the set of transformations required on the z axis, which would be [0,0,1]
See what happens by feeding facial landmarks into the encoder-decoder modelImplemented without umeyama, need to implement again with umeyamaImplement umeyama after resizing toWorks(256,256)
Figure out why umeyama worksTried my bestIntegrate albumentationsImproved performance- Reduce cropping of faces in generate_training_data.py, face_alignment is not able to detect the fake face landmarks.