This paper combines a parametric model and a generative model to generate synthetic signing videos in which the signer appearance can be customized in a zero-shot manner based on an image or text prompt. The parametric model is used to retarget the signing poses with high fidelity while a diffusion model is used to control the appearance of the synthetic signer. This repo provides the implementation of the generative phase. For retargeting the signing poses from human signing videos to a 3D avatar, we used a pretrained SMPLify-X model and rendered the 3D mesh into video frames using Blender. Please refer to the paper for further details.
-
Please download the pretrained IP-Adapter model as mentioned in the site.
-
Stable Diffusion v1.5 (Got better results with v1.5 than later SD versions using IP-Adapter).
-
In a virtual environment (conda, venv), execute:
pip install requirements.txt
-
The code uses pretrained IP-Adapter model, so no training is needed from scratch, but model can be fine-tuned on a few images using DreamBooth (available in the diffusers package on HuggingFace) for personalization.
-
In the code, the paths to the source videos and images are hard-coded. Please change these to your paths before running the code.
-
Generation using image prompt or multimodal prompt (image +text):
python gen_diff_signer_image_prompt.py
- Generation using only text prompt :
python gen_diff_signer_text_prompt.py
- Visual quality (SSIM, FID metrics):
python compute_vis_quality.py
- Directional similarity:
python compute_dir_sim.py