Skip to content

Original PyTorch implementation of the code for the paper "Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method" at the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2022

Notifications You must be signed in to change notification settings

verlab/TextDrivenVideoAcceleration_TPAMI_2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

878813d · Apr 22, 2023

History

24 Commits
Mar 30, 2022
Apr 1, 2022
Apr 22, 2023
Apr 22, 2023
Mar 31, 2022
Apr 22, 2023
Apr 1, 2022
Mar 30, 2022
Apr 1, 2022
Apr 1, 2022
Apr 1, 2022

Repository files navigation

License Open In Colab

Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method
[Project Page] [Paper] [Video]

This repository contains the original implementation of the paper Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method, published at the TPAMI 2022.

We present a novel weakly-supervised methodology based on a reinforcement learning formulation to accelerate instructional videos using text. A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length without creating gaps in the final video. We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space to represent both textual and visual data.

If you find this code useful for your research, please cite the paper:

@ARTICLE{Ramos_2023_TPAMI,
  author={Ramos, Washington and Silva, Michel and Araujo, Edson and Moura, Victor and Oliveira, Keller and Marcolino, Leandro Soriano and Nascimento, Erickson R.},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method}, 
  year={2023},
  volume={45},
  number={2},
  pages={2492-2504},
  doi={10.1109/TPAMI.2022.3157198}}

Usage 💻

Following, we describe different ways to use our code.

PyTorch Hub Model

We provide PyTorch Hub integration.

Loading a pretrained model and fast-forwarding your own video is pretty simple!

import torch

model = torch.hub.load('verlab/TextDrivenVideoAcceleration_TPAMI_2022:main', 'TextDrivenAcceleration', pretrained=True)
model.cuda()
model.eval()

document = ['sentence_1', 'sentence_2', ..., 'sentence_N'] # Document of N sentences that will guide the agent semantically
sf = model.fast_forward_video(video_filename='video_filename.mp4',
                              document=document,
                              desired_speedup=12,
                              output_video_filename='output_filename.avi') # Returns the selected frames

print('Selected Frames: ', sf)

Demos

We provide convinient demos in CoLab.

Description Link
Process a video using our agent Open In Colab
Train VDAN+ using VaTeX Open In Colab
Train the agent using YouCook2 Open In Colab
Extract VDAN+ feats from a video Open In Colab

Data & Code Preparation

If you want to download the code and run it by yourself in your environment, or reproduce our experiments, please follow the next steps:

  • 1. Make sure you have the requirements

    • Python (>=3.6)
    • PyTorch (=1.10.0) # Maybe it works with other versions
  • 2. Clone the repo and install the dependencies

    git clone https://github.com/verlab/TextDrivenVideoAcceleration_TPAMI_2022.git
    cd TextDrivenVideoAcceleration_TPAMI_2022
    pip install -r requirements.txt
  • 3. Prepare the data to train VDAN+

    Download & Organize the VaTeX Dataset (Annotations and Videos) + Download the Pretrained GloVe Embeddings

    ## Download VaTeX JSON data
    wget -O semantic_encoding/resources/vatex_training_v1.0.json https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json
    wget -O semantic_encoding/resources/vatex_validation_v1.0.json https://eric-xw.github.io/vatex-website/data/vatex_validation_v1.0.json
    
    ## Download the Pretrained GloVe Embeddings
    wget -O semantic_encoding/resources/glove.6B.zip http://nlp.stanford.edu/data/glove.6B.zip
    unzip -j semantic_encoding/resources/glove.6B.zip glove.6B.300d.txt -d semantic_encoding/resources/
    rm semantic_encoding/resources/glove.6B.zip
    
    ## Download VaTeX Videos (We used the kinetics-datasets-downloader tool to download the available videos from YouTube)
    # NOTE: VaTeX is composed of the VALIDATION split of the Kinetics-600 dataset; therefore, you must modify the script to download the validation videos only. 
    # We adpated the function download_test_set in the kinetics-datasets-downloader/downloader/download.py file to do so.
    # 1. Clone repository and copy the modified files
    git clone https://github.com/dancelogue/kinetics-datasets-downloader/ semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/
    cp semantic_encoding/resources/VaTeX_downloader_files/download.py semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/download.py
    cp semantic_encoding/resources/VaTeX_downloader_files/config.py semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/lib/config.py
    
    # 2. Get the kinetics dataset annotations
    wget -O semantic_encoding/resources/VaTeX_downloader_files/kinetics600.tar.gz https://storage.googleapis.com/deepmind-media/Datasets/kinetics600.tar.gz
    tar -xf semantic_encoding/resources/VaTeX_downloader_files/kinetics600.tar.gz -C semantic_encoding/resources/VaTeX_downloader_files/
    rm semantic_encoding/resources/VaTeX_downloader_files/kinetics600.tar.gz
    
    # 3. Download the videos (This can take a while (~28k videos to download)... If you want, you can stop it at any time and train with the downloaded videos)
    python3 semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/download.py --val
    
    # Troubleshooting: If the download stops for a long time, experiment increasing the queue size in the parallel downloader (semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/lib/parallel_download.py)

    If you want just to train VDAN+, you're now set!

  • 4. Prepare the data to train the Skip-Aware Fast-Forward Agent (SAFFA)

    Download & Organize the YouCook2 Dataset (Annotations and Videos)

    # Download and extract the annotations
    wget -O rl_fast_forward/resources/YouCook2/youcookii_annotations_trainval.tar.gz http://youcook2.eecs.umich.edu/static/YouCookII/youcookii_annotations_trainval.tar.gz
    tar -xf rl_fast_forward/resources/YouCook2/youcookii_annotations_trainval.tar.gz -C rl_fast_forward/resources/YouCook2/
    rm rl_fast_forward/resources/YouCook2/youcookii_annotations_trainval.tar.gz
    
    # Download the scripts used to collect the videos
    wget -O rl_fast_forward/resources/YouCook2/scripts.tar.gz http://youcook2.eecs.umich.edu/static/YouCookII/scripts.tar.gz
    tar -xf rl_fast_forward/resources/YouCook2/scripts.tar.gz -C rl_fast_forward/resources/YouCook2/
    rm rl_fast_forward/resources/YouCook2/scripts.tar.gz
    
    wget -O rl_fast_forward/resources/YouCook2/splits.tar.gz http://youcook2.eecs.umich.edu/static/YouCookII/splits.tar.gz
    tar -xf rl_fast_forward/resources/YouCook2/splits.tar.gz -C rl_fast_forward/resources/YouCook2/
    rm rl_fast_forward/resources/YouCook2/splits.tar.gz
    
    # Install youtube-dl and download the available videos
    pip install youtube_dl # PS.: The YouTube-DL have been slow lately. If your download speed is under 100KiB/s, consider changing it to the YT-DLP fork (https://github.com/yt-dlp/yt-dlp)
    cd rl_fast_forward/resources/YouCook2/scripts
    python download_youcookii_videos.py

Training ⏳

After running the setup above, you're ready to train the networks.

Training VDAN+

To train VDAN+, you first need to set up the model and train parameters (current parameters are the same as described in the paper) in the semantic_encoding/config.py file, then run the training script semantic_encoding/train.py.

The training script will save the model in the semantic_encoding/models folder.

  • 1. Setup

    model_params = {
        'num_input_frames': 32,
        'word_embed_size': 300,
        'sent_embed_size': 512,  # h_ij
        'doc_embed_size': 512,  # h_i
        'hidden_feat_size': 512,
        'feat_embed_size': 128,  # d = 128. We also tested with 512 and 1024, but no substantial changes
        'sent_rnn_layers': 1,  # Not used in our paper, but feel free to change
        'word_rnn_layers': 1,  # Not used in our paper, but feel free to change
        'word_att_size': 1024,  # c_p
        'sent_att_size': 1024,  # c_d
    
        'use_sentence_level_attention': True,  # Not used in our paper, but feel free to change
        'use_word_level_attention': True,  # Not used in our paper, but feel free to change
        'use_visual_shortcut': True,  # Uses the R(2+1)D output as the first hidden state (h_0) of the document embedder Bi-GRU.
        'learn_first_hidden_vector': False  # Learns the first hidden state (h_0) of the document embedder Bi-GRU.
    }
    
    ETA_MARGIN = 0.  # η from Equation 1 - (Section 3.1.3 Training)
    
    train_params = {
        # VaTeX
        'captions_train_fname': 'resources/vatex_training_v1.0.json', # Run semantic_encoding/resources/download_resources.sh first to obtain this file
        'captions_val_fname': 'resources/vatex_validation_v1.0.json', # Run semantic_encoding/resources/download_resources.sh first to obtain this file
        'train_data_path': 'datasets/VaTeX/raw_videos/', # Download all Kinetics-600 (10-seconds) validation videos using the semantic_encoding/resources/download_vatex_videos.sh script
        'val_data_path': 'datasets/VaTeX/raw_videos/', # Download all Kinetics-600 (10-seconds) validation videos using the semantic_encoding/resources/download_vatex_videos.sh script
    
        'embeddings_filename': 'resources/glove.6B.300d.txt', # Run semantic_encoding/resources/download_resources.sh first to obtain this file
    
        'max_sents': 20,  # maximum number of sentences per document
        'max_words': 20,  # maximum number of words per sentence
    
        # Training parameters
        'train_batch_size': 64, # We used a batch size of 64 (requires a 24Gb GPU card)
        'val_batch_size': 64, # We used a batch size of 64 (requires a 24Gb GPU card)
        'num_epochs': 100, # We ran in 100 epochs
        'learning_rate': 1e-5,
        'model_checkpoint_filename': None,  # Add an already trained model to continue training (Leave it as None to train from scratch)...
    
        # Video transformation parameters
        'resize_size': (128, 171),  # h, w
        'random_crop_size': (112, 112),  # h, w
        'do_random_horizontal_flip': True,  # Horizontally flip the whole video randomly in block
    
        # Training process
        'optimizer': 'Adam',
        'eta_margin': ETA_MARGIN,
        'criterion': nn.CosineEmbeddingLoss(ETA_MARGIN),
    
        # Machine and user data
        'username': getpass.getuser(),
        'hostname': socket.gethostname(),
    
        # Logging parameters
        'checkpoint_folder': 'models/',
        'log_folder': 'logs/',
    
        # Debugging helpers (speeding things up for debugging)
        'use_random_word_embeddings': False,  # Choose if you want to use random embeddings
        'train_data_proportion': 1.,  # Choose how much data you want to use for training
        'val_data_proportion': 1.,  # Choose how much data you want to use for validation
    }
    
    models_paths = {
        'VDAN': '<PATH/TO/THE/VDAN/MODEL>', # OPTIONAL: Provide the path to the VDAN model (https://github.com/verlab/StraightToThePoint_CVPR_2020/releases/download/v1.0.0/vdan_pretrained_model.pth) from the CVPR paper: https://github.com/verlab/StraightToThePoint_CVPR_2020/
        'VDAN+': '<PATH/TO/THE/VDAN+/MODEL>' # You must fill this path after training the VDAN+ to train the SAFFA agent
    }
    
    deep_feats_base_folder = '<PATH/TO/THE/VDAN+EXTRACTED_FEATS/FOLDER>' # Provide the location you stored/want to store your VDAN+ extracted feature vectors
  • 2. Train

    First, make sure you have punkt installed...

    import nltk
    nltk.download('punkt')

    Finally, you're ready to go! 😃

    cd semantic_encoding
    python train.py

Training the Skip-Aware Fast-Forward Agent (SAFFA)

  • To train the agent, you will need the features produced the VDAN+ model. You can have these features here and here. To get it via terminal, use:

    # Download YouCook2's VDAN+ video feats
    wget -O rl_fast_forward/resources/YouCook2/VDAN+/youcook2_vdan+_vid_feats.zip https://verlab.dcc.ufmg.br/TextDrivenVideoAcceleration/youcook2_vdan+_vid_feats.zip
    unzip -q rl_fast_forward/resources/YouCook2/VDAN+/youcook2_vdan+_vid_feats.zip -d rl_fast_forward/resources/YouCook2/VDAN+/vid_feats/
    rm rl_fast_forward/resources/YouCook2/VDAN+/youcook2_vdan+_vid_feats.zip
    
    # Download YouCook2's VDAN+ document feats
    wget -O rl_fast_forward/resources/YouCook2/VDAN+/youcook2_vdan+_doc_feats.zip https://verlab.dcc.ufmg.br/TextDrivenVideoAcceleration/youcook2_vdan+_doc_feats.zip
    unzip -q rl_fast_forward/resources/YouCook2/VDAN+/youcook2_vdan+_doc_feats.zip -d rl_fast_forward/resources/YouCook2/VDAN+/doc_feats/
    rm rl_fast_forward/resources/YouCook2/VDAN+/youcook2_vdan+_doc_feats.zip
  • If you want to extract them by yourself, you can have a VDAN+ pretrained model by following the instructions in the previous step or downloading a pretrained one we provide here. In terminal, use:

    # Download the pretrained model
    wget -O semantic_encoding/models/vdan+_model_pretrained.pth https://github.com/verlab/TextDrivenVideoAcceleration_TPAMI_2022/releases/download/pre_release/vdan+_pretrained_model.pth
  • Now, prepare the data for training...

    cd rl_fast_forward
    python resources/create_youcook2_recipe_documents.py
  • You are set! Now, you just need to run it...

    python train.py -s ../semantic_encoding/models/vdan+_model_pretrained.pth -d YouCook2
  • After training, the model will be saved in the rl_fast_forward/models folder.

Inference

  • You can test the agent using a saved model for the YouCook2 dataset as follows:

    python test.py -s ../semantic_encoding/models/vdan+_model_pretrained.pth -m models/saffa_vdan+_model.pth -d YouCook2 -x 12
  • This script will generate a results JSON file with the pattern results/<datetime>_<hostname>_youcookii_selected_frames.json

Evaluating

We provide, in the rl_fast_forward/eval folder, a script to evaluate the selected frames generated by the trained agent.

  • To compute Precision, Recall, F1 Score, and Output Speed for your results using the JSON output (generated when training the agent), run the following script:

    cd rl_fast_forward/eval
    python eval_results.py -gt youcookii_gts.json -sf /path/to/the/JSON/output/file.json
  • You may need to download the ground-truth file first:

    cd rl_fast_forward/eval
    
    # For the YouCook2 dataset
    wget https://verlab.dcc.ufmg.br/TextDrivenVideoAcceleration/youcookii_gts.json
    
    # For the COIN dataset
    wget https://verlab.dcc.ufmg.br/TextDrivenVideoAcceleration/coin_gts.json
  • It will display the values in your screen and generate a JSON and a CSV output file formatted as: /path/to/the/JSON/output/file_results.EXT

  • If you want to reproduce our results, we also provide the selected frames for the compared approaches here. It can be downloaded by running:

    wget https://verlab.dcc.ufmg.br/TextDrivenVideoAcceleration/results.zip
    unzip -q results.zip
    rm results.zip    

Contact

Authors

Institution

Universidade Federal de Minas Gerais (UFMG)
Departamento de Ciência da Computação
Belo Horizonte - Minas Gerais - Brazil

Laboratory

VeRLab UFMG

VeRLab: Laboratory of Computer Vison and Robotics
https://www.verlab.dcc.ufmg.br


Acknowledgements

We thank the agencies CAPES, CNPq, FAPEMIG, and Petrobras for funding different parts of this work.

Enjoy it! 😃

About

Original PyTorch implementation of the code for the paper "Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method" at the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2022

Resources

Stars

Watchers

Forks

Packages

No packages published