Visual Question Answering (VQA) is an interdisciplinary research area that lies at the intersection of computer vision (CV) and natural language processing (NLP). It aims to develop AI systems capable of understanding and answering questions about images. For better understanding, you can look over this page where our projects focus on binaryQA
Note: If you can't download using gdown due to limited number of downloads, please download it manually and upload it to your drive, then copy it from the drive to colab.
from google.colab import drive
drive.mount('/content/drive')
!cp /path/to/dataset/on/your/drive .
You can download the dataset via this Google Drive then unzip using !unzip -q vqa_coco_dataset.zip
A simple look for our dataset
- Understanding Images: Extracting meaningful information from images is challenging due to variations in lighting, viewpoint, occlusions, etc.
- Understanding Questions: Interpreting natural language questions accurately and comprehensively is complex, especially considering the diverse ways humans can express the same concept.
- Fusion of Vision and Language: Integrating information from both visual and textual modalities effectively is crucial for accurate answers.
- Feature Extraction: Extracting features from images using pre-trained convolutional neural networks (CNNs) like ResNet, VGG, or transformers like Vision Transformers (ViTs).
- Language Understanding: Utilizing pre-trained language models (LMs) like BERT, GPT, or specific models for question understanding.
- Fusion Techniques: Combining visual and textual features through methods like concatenation, attention mechanisms, or multi-modal embeddings.
- Answer Prediction: Predicting answers using classification techniques, generating answers with sequence models, or using attention mechanisms.
- Install needed packages through
requirements.txt
- Preprocess the original dataset via
dataset/preprocess.py
- Our approach using Transformers-based text model and visual Pretrained model. They are in
model.py
and training process intrain.py
. We also test on several methods to explore the performance of different architecture on this task.
Approach | Accuracy |
---|---|
CNN x LSTM | 69.57 % |
ViT x Roberta | 66.7% |
Pretrained CLIPs | 75.57% |
Below is a brief overview of our approach.