GitHub - tnt305/Visual-Question-Answering: Simple Question Answering on Visual COCO Dataset

ABOUT VISUAL QA

Visual Question Answering (VQA) is an interdisciplinary research area that lies at the intersection of computer vision (CV) and natural language processing (NLP). It aims to develop AI systems capable of understanding and answering questions about images. For better understanding, you can look over this page where our projects focus on binaryQA

Download dataset

Note: If you can't download using gdown due to limited number of downloads, please download it manually and upload it to your drive, then copy it from the drive to colab.

from google.colab import drive

drive.mount('/content/drive')
!cp /path/to/dataset/on/your/drive .

You can download the dataset via this Google Drive then unzip using !unzip -q vqa_coco_dataset.zip

A simple look for our dataset

Additional

Challenges:

Understanding Images: Extracting meaningful information from images is challenging due to variations in lighting, viewpoint, occlusions, etc.
Understanding Questions: Interpreting natural language questions accurately and comprehensively is complex, especially considering the diverse ways humans can express the same concept.
Fusion of Vision and Language: Integrating information from both visual and textual modalities effectively is crucial for accurate answers.

Approaches:

Feature Extraction: Extracting features from images using pre-trained convolutional neural networks (CNNs) like ResNet, VGG, or transformers like Vision Transformers (ViTs).
Language Understanding: Utilizing pre-trained language models (LMs) like BERT, GPT, or specific models for question understanding.
Fusion Techniques: Combining visual and textual features through methods like concatenation, attention mechanisms, or multi-modal embeddings.
Answer Prediction: Predicting answers using classification techniques, generating answers with sequence models, or using attention mechanisms.

Getting Start

Install needed packages through

requirements.txt

Preprocess the original dataset via

dataset/preprocess.py

Our approach using Transformers-based text model and visual Pretrained model. They are in model.py and training process in train.py. We also test on several methods to explore the performance of different architecture on this task.

Approach	Accuracy
CNN x LSTM	69.57 %
ViT x Roberta	66.7%
Pretrained CLIPs	75.57%

Below is a brief overview of our approach.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
clip_pretrained		clip_pretrained
cnn_lstm		cnn_lstm
fig		fig
roberta_vit		roberta_vit
README.md		README.md
plot.py		plot.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABOUT VISUAL QA

Download dataset

Additional

Challenges:

Approaches:

Getting Start

About

Releases

Packages

Languages

tnt305/Visual-Question-Answering

Folders and files

Latest commit

History

Repository files navigation

ABOUT VISUAL QA

Download dataset

Additional

Challenges:

Approaches:

Getting Start

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages