Skip to content

Figuring out how to generate deepfakes and making notes on the way

Notifications You must be signed in to change notification settings

Mayukhdeb/deepfakes

Repository files navigation

deepfakes

Open In Colab

I will arrange all of the notes properly once I get a better idea of how it works.

Important links

  1. Deepfake paper
  2. Pixel shuffling paper
  3. Article on super-resolution
  4. Umeyama algorithm paper
  5. Article on transformations of images
  6. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network i.e pixel shuffling paper
  7. Useful article in super-resolution
  8. Useful article to understand the concept of transformation matrices

Notes from the deepfake paper

The original method

This required three main parts:

  1. An encoder that encodes an image into a lower dimensional vector.
  2. A decoder that reproduces face A from the encoded vector
  3. Another decoder that reproduces face B from the encoded vector.

This image from the original paper summarises it better than anything else:


Notes from the pixel shuffling paper

Intro

Generally, super-resolution operation is done on the high resolution space, but in this paper, they proposed a new way to do it in the low resolution space, which in turn would require less computational power.

More about SR operations

The SR operation is effectively a one-to-many mapping from LR to HR space which can have multiple solutions. A key assumption that underlies many SR techniques is that much of the high-frequency data is redundant and thus can be accurately reconstructed from low frequency components.

Important full form: PSNR = Peak Signal to Noise Ratio. Higher means good quality and low means bad quality w.r.t the original image. It is measured in decibels (dB). Here's the wikipedia page

Drawbacks of older approaches

  • Increasing the resolution of the LR images before the image enhancement step increases the computational complexity.
  • Interpolation methods such as bicubic interpolation do not bring additional information to solve the problem.

What's new in their approach ?

Contrary to previous works, they increase the resolution from LR to HR only at the very end of the network and super-resolve HR data from LR feature maps. It's advantages are;

  1. Requires lower computational power.
  2. Not using an explicit interpolation filter means that the network implicitly learns the processing necessary for SR.

Transpose convolutions v/s sub-pixel convolutions

In transpose convolutions, upsampling with strides adss zero values to upscale the image which are to be filled later on. Maybe even worse, these zero values have no gradient information that can be backpropagated through.

While sub-pixel convolutional layers essentially uses regular convolutional layers followed by a specific type of image reshaping called a phase shift. Instead of putting zeros in between pixels and having to do extra computation, they calculate more convolutions in lower resolution and resize the resulting map into an upscaled image. This way, no meaningless zeros are necessary.

Some parts written above are quoted from this repository

Phase shift is also called “pixel shuffle”, which is a tensor of size H × W × C · r² to size rH × rW × C tensor as shown below. An implementation of this operation can be found here.


Understanding the significance of the umeyama algorithm

Mr. Shinji Umeyama asked:

If 2 point patterns are given, what is the set of similarity transformation parameters that give the least mean squared error between the patterns ?

And this is exactly what the umeyama algorithm does, it finds a set of similarity transformation parameters (rotation, translation, scaling) that minimizes the MSE Loss between the patterns.

Note

  • The transformed pattern has the minimum possible MSE loss w.r.t the target pattern.
  • The transformed pattern is similar to the source pattern.

How does it help here in deepfakes ?

We're generating a target image such that it's MSE loss w.r.t the distorted input image is minimized.Thanks to the umeyama algorithm, we are able to do this without distorting the key visual features.

The two important points to note are:

  • The set of similarity transformations applied to the original image is such that the MSE loss between the target image and the distorted image has been minimized
  • The target image is similar to the original image (i.e no distortions)

Now what is a similarity transformation matrix ?

It represents a set of operations that can be done on a matrix A to get another matrix B that is similar to A Each value in a transformation matrix in 2D represents the following:

[[size, rotation, location], ←x-axis
[rotation, size, location]] ←y-axis

A default matrix or one that wouldn’t change anything would look like;

[[1, 0, 0]
 [0, 1, 0]]

Or if we want to alter just the width (half the width along the x axis), it would look like:

[[0.5, 0, 0], #x 
 [0, 1, 0]]   #y

We're taking only the first 2 indices of the transformation matrix because the third one represents the set of transformations required on the z axis, which would be [0,0,1]

to-do:

  1. See what happens by feeding facial landmarks into the encoder-decoder model Implemented without umeyama, need to implement again with umeyama
  2. Implement umeyama after resizing to (256,256) Works
  3. Figure out why umeyama works Tried my best
  4. Integrate albumentations Improved performance
  5. Reduce cropping of faces in generate_training_data.py, face_alignment is not able to detect the fake face landmarks.

About

Figuring out how to generate deepfakes and making notes on the way

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published