Synthesizing GIFs/Videos based on text descriptions, action labels and start frame by leveraging Discrete Denoising Diffusion Probabilistic Models (D3PM).
conda create -n gifsyn python=3.9
conda activate gifsyn
# Clone repository
git clone https://github.com/andrewfengusa/TextMotionGenerator.git
# Install Pytorch 1.10.0 (**CUDA 11.1**)
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
# Install required pacakges
pip install -r requirements.txt
All experiments have been performed on UCF101 dataset. Also tested the model on MSRVTT dataset.
Code to setup this dataset is taken from here. Run the following script to download the compressed data and then extract it into train-test splits.
sh create_ucf_dataset.sh YOUR_DATA_PATH
You can directly download the MSRVTT video files from this Google Drive link.
Pretrained models can be downloaded from this Google Drive link. It includes the VQVAE checkpoint on UCF101 Dataset, trained action label to video discrete diffusion model on UCF101 and finally an Inception3D checkpoint for FVD evaluation trained for 224x224, 16 frame videos.
The job batch script vqvae.job can be run to start the VQ-VAE training. The default hyperpameters are listed below and can be changed from the script.
dataset - ucf101.yaml (Can be changed to msrvtt.yaml)
max_epochs - 2000
batch_size - 64
n_codes (Codebook) - 4096
n_res_layers - 3
downsample - [1,8,8]
The job batch script ucf101-ddiff-train.job can be run to start the training. This training requires a pretrained VQ-VAE model. The pre-trained VQVAE downloaded can be used here or a separate one can be trained. The path to this VQVAE checkpoint has to be entered in configs/model/discrete_diffusion.yaml under autoencoders. The default hyperpameters are listed below and can be changed from the script. Remember to use the same hyperparameters related to VQVAE as the pretrained checkpoint. The training will also run inference every 10 epochs, synthesizing videos to visualize. These videos and model checkpoints will be stored in logs/train.
dataset - ucf101.yaml (Can be changed to msrvtt.yaml)
max_epochs - 2000
batch_size - 16
diffusion_step - 100
guidance_scale - 2
Some of the videos generated by our model are given here.