Skip to content

Conditional Video/GIF Synthesis implementation using PyTorch Lightning and Hydra. This method utilizes Vector Quantization Variational AutoEncoder (VQ-VAE) with Discrete Denoising Diffusion Probabilistic Models (D3PM) to generate novel videos.

Notifications You must be signed in to change notification settings

Developer-Zer0/GIF-synthesis-with-Discrete-Diffusion

Repository files navigation

Conditional GIF/Video Synthesis

Synthesizing GIFs/Videos based on text descriptions, action labels and start frame by leveraging Discrete Denoising Diffusion Probabilistic Models (D3PM).

Architecture

Requirements

Instructions to setup environment

conda create -n gifsyn python=3.9
conda activate gifsyn
# Clone repository
git clone https://github.com/andrewfengusa/TextMotionGenerator.git
# Install Pytorch 1.10.0 (**CUDA 11.1**)
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
# Install required pacakges
pip install -r requirements.txt

Dataset

All experiments have been performed on UCF101 dataset. Also tested the model on MSRVTT dataset.

UCF101 Setup

Code to setup this dataset is taken from here. Run the following script to download the compressed data and then extract it into train-test splits.

sh create_ucf_dataset.sh YOUR_DATA_PATH

MSRVTT Setup

You can directly download the MSRVTT video files from this Google Drive link.

Prertained Models

Pretrained models can be downloaded from this Google Drive link. It includes the VQVAE checkpoint on UCF101 Dataset, trained action label to video discrete diffusion model on UCF101 and finally an Inception3D checkpoint for FVD evaluation trained for 224x224, 16 frame videos.

Training Stage-1 VQ-VAE

The job batch script vqvae.job can be run to start the VQ-VAE training. The default hyperpameters are listed below and can be changed from the script.

dataset - ucf101.yaml (Can be changed to msrvtt.yaml)
max_epochs - 2000
batch_size - 64
n_codes (Codebook) - 4096
n_res_layers - 3
downsample - [1,8,8]

Training Stage-2 Discrete Diffusion

The job batch script ucf101-ddiff-train.job can be run to start the training. This training requires a pretrained VQ-VAE model. The pre-trained VQVAE downloaded can be used here or a separate one can be trained. The path to this VQVAE checkpoint has to be entered in configs/model/discrete_diffusion.yaml under autoencoders. The default hyperpameters are listed below and can be changed from the script. Remember to use the same hyperparameters related to VQVAE as the pretrained checkpoint. The training will also run inference every 10 epochs, synthesizing videos to visualize. These videos and model checkpoints will be stored in logs/train.

dataset - ucf101.yaml (Can be changed to msrvtt.yaml)
max_epochs - 2000
batch_size - 16
diffusion_step - 100
guidance_scale - 2

Rendered videos

Some of the videos generated by our model are given here.

About

Conditional Video/GIF Synthesis implementation using PyTorch Lightning and Hydra. This method utilizes Vector Quantization Variational AutoEncoder (VQ-VAE) with Discrete Denoising Diffusion Probabilistic Models (D3PM) to generate novel videos.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages