You can find the dataset instructions in DATASET. We have provide all the metadata files of our data.
You can find all the models and the scripts in MODEL_ZOO.
We use CLIP pretrained models as the unmasked teachers by default:
- Follow extract.ipynb to extract visual encoder from CLIP.
- Change
MODEL_PATH
in clip.py.
For training, you can simply run the pretraining scripts as follows:
# masked pretraining
bash ./exp_pt/videomamba_middle_5m/run.sh
# further unmasked pretraining for 1 epoch
bash ./exp_pt/videomamba_middle_5m_unmasked/run.sh
Notes:
- Set
data_dir
andyour_data_path
likeyour_webvid_path
in data.py before running the scripts.- Set
vision_encoder.pretrained
invision_encoder.pretrained
in the corresponding config files.- Set
--rdzv_endpoint
to yourMASTER_NODE:MASTER_PORT
in torchrun.sh.save_latest=True
will automatically save the latest checkpoint while training.auto_resume=True
will automatically loaded the best or latest checkpoint while training.- For unmasked pretraining, please set
pretrained_path
to load the masked pretrained epoch.
For zero-shot evaluation, you can simply run the pretraining scripts as follows:
bash ./exp_zs/msrvtt/run.sh
Notes:
- Set
pretrained_path
in the running scripts before running the scripts.- Set
zero_shot=True
andevaluate=True
for zero-shot evaluation