This repository contains the implementation of the Vision-and-Language Transformer (ViLT) model fine-tuned for Visual Question Answering (VQA) tasks. The project is structured to be easy to set up and use, providing a streamlined approach for experimenting with different configurations and datasets.
- Clone the Repository
git clone https://your-repository-url.git
cd vilt-vqa
- Install Dependencies
pip install -r requirements.txt
- Download Data
Ensure that your data files are in the
data/
directory as specified insettings.py
.
To train the model, run:
python train.py
This script will train the model using the configurations specified in config/settings.py
.
To perform inference with a pre-trained model, run:
python infer.py --image_path 'path/to/image.jpg' --question 'What is in the picture?'
This will load the trained model and output the top predictions for the specified image and question.
Edit config/settings.py
to modify paths, model parameters, and other settings like device configuration for GPU acceleration.