This repo contains the dataset & code for exploring multimodal fusion-type transformer models (from Huggingface 🤗) for the task of visual question answering.
🗂️ Dataset Used: DAQUAR Dataset
Create a virtual environment & install the required packages using pip install -r requirements.txt:
In particular, we will require the following packages:
datasets==1.17.0nltk==3.5pandas==1.3.5Pillow==9.0.0scikit-learn==0.23.2torch==1.8.2+cu111transformers==4.15.0dvc==2.9.3(for automating the training pipeline)
Note: It is best to have some GPU available to train the multimodal models (Google Colab can be used).
📝 Notebook: VisualQuestionAnsweringWithTransformers.ipynb
The src/ folder contains all the scripts necessary for data processing & model training. All the configs & hyperparameters are specified in teh params.yaml file.
Following are the important scripts to experiment with the VQA models:
src/process_data.py: Process the raw DAQUAR dataset available indataset/folder & split it into training & evaluation sets, along with the space of all possible answerssrc/main.py: Train & evaluate the multimodal VQA model after loading the processed dataset.src/inference.py: Use a trained multimodal VQA model from a checkpoint to answer a question, given a reference image
After making necessary changes to the params.yaml file, the pipeline can be automated by running dvc repro. This will run the data processing & model training (& evaluation) stages.
For inferencing, run python src/inference.py --config=params.yaml --img_path=<path-to-image> --question=<question>
- Text Transformers (for encoding questions):
- BERT (Bidirectional Encoder Representations from Transformers):
'bert-base-uncased' - RoBERTa (Robustly Optimized BERT Pretraining Approach):
'roberta-base' - ALBERT (A Lite BERT):
'albert-base-v2'
- BERT (Bidirectional Encoder Representations from Transformers):
- Image Transformers (for encoding images):
- ViT (Vision Transformer):
'google/vit-base-patch16-224-in21k' - DeiT (Data-Efficient Image Transformer):
'facebook/deit-base-distilled-patch16-224' - BEiT (Bidirectional Encoder representation from Image Transformers):
'microsoft/beit-base-patch16-224-pt22k-ft22k'
- ViT (Vision Transformer):
To know about Wu-Palmer Similarity Score, check out this video!
| Text Transformer | Image Transformer | Wu & Palmer Score | Accuracy | F1 | No. of Trainable Parameters |
|---|---|---|---|---|---|
| BERT | ViT | 0.286 | 0.235 | 0.020 | 197M |
| BERT | DeiT | 0.297 | 0.246 | 0.027 | 197M |
| BERT | BEiT | 0.303 | 0.254 | 0.034 | 196M |
| RoBERTa | ViT | 0.294 | 0.246 | 0.025 | 212M |
| RoBERTa | DeiT | 0.291 | 0.242 | 0.028 | 212M |
| RoBERTa | BEiT | 0.308 | 0.261 | 0.033 | 211M |
| ALBERT | ViT | 0.265 | 0.215 | 0.018 | 99M |
| ALBERT | DeiT | 0.140 | 0.085 | 0.002 | 99M |
| ALBERT | BEiT | 0.220 | 0.162 | 0.017 | 98M |
Created with ❤️ by Tezan Sahu