Skip to content

tnt305/Visual-Question-Answering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ABOUT VISUAL QA

Visual Question Answering (VQA) is an interdisciplinary research area that lies at the intersection of computer vision (CV) and natural language processing (NLP). It aims to develop AI systems capable of understanding and answering questions about images. For better understanding, you can look over this page where our projects focus on binaryQA

Download dataset

Note: If you can't download using gdown due to limited number of downloads, please download it manually and upload it to your drive, then copy it from the drive to colab.

from google.colab import drive

drive.mount('/content/drive')
!cp /path/to/dataset/on/your/drive .

You can download the dataset via this Google Drive then unzip using !unzip -q vqa_coco_dataset.zip

A simple look for our dataset

Additional

Challenges:

  • Understanding Images: Extracting meaningful information from images is challenging due to variations in lighting, viewpoint, occlusions, etc.
  • Understanding Questions: Interpreting natural language questions accurately and comprehensively is complex, especially considering the diverse ways humans can express the same concept.
  • Fusion of Vision and Language: Integrating information from both visual and textual modalities effectively is crucial for accurate answers.

Approaches:

  • Feature Extraction: Extracting features from images using pre-trained convolutional neural networks (CNNs) like ResNet, VGG, or transformers like Vision Transformers (ViTs).
  • Language Understanding: Utilizing pre-trained language models (LMs) like BERT, GPT, or specific models for question understanding.
  • Fusion Techniques: Combining visual and textual features through methods like concatenation, attention mechanisms, or multi-modal embeddings.
  • Answer Prediction: Predicting answers using classification techniques, generating answers with sequence models, or using attention mechanisms.

Getting Start

  • Install needed packages through
requirements.txt 
  • Preprocess the original dataset via
dataset/preprocess.py 
  • Our approach using Transformers-based text model and visual Pretrained model. They are in model.py and training process in train.py. We also test on several methods to explore the performance of different architecture on this task.
Approach Accuracy
CNN x LSTM 69.57 %
ViT x Roberta 66.7%
Pretrained CLIPs 75.57%

Below is a brief overview of our approach.

About

Simple Question Answering on Visual COCO Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages