[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
dataset vision-language audio-language multimodal-foundation-model cross-modality-pretraining vision-audio-subtitle-text
-
Updated
Mar 14, 2024 - Jupyter Notebook