KALAPA CHALLENGE _ Vietnamese Medical Question Answering

Public Score

In the initial stage of the Challenge, a custom scoring metric was employed for evaluation purposes.

My achieved score stood at 74.04 out of 85.79, positioning me at rank 17 relative to the highest-scoring team. In addition, my accuracy surpassed that of several other teams, equating to a ranking around 9th or 10th place with a score of 0.67 out of 0.7879, compared to the highest accuracy team.

Rank	Team name	Score	Accuracy
17	Thiên Đặng_AIO (me)	0.7404	0.6667

Description

Currently, with the development of modern language models, many chatbots and language assistants have been built to solve various problems. However, building a Vietnamese language model still faces many limitations.
In this challenge, participating teams will build a language model capable of answering multiple-choice questions (with one or more correct answers) in the medical field, based on the provided dataset.

Problem Statement

The challenge organizers provide data on common diseases, with each disease having from 1 to 2 articles, including information related to causes, symptoms, disease prevention methods, etc.

Input

Vietnamese multiple-choice questions, each with 2 to 6 options, with at least one correct option.

Output

The systems of participating teams need to return answers in binary string format. For each question with n options, you need to return a binary string of length n, where the iii-th element of the binary string is 0 if the iii-th option in the question is incorrect and vice versa. For example, for a question with 5 options A, B, C, D, E; where the correct answers are B, E, the output should be 01001.

Instructions

Ensure your system is capable of processing Vietnamese text effectively.
Develop a model that can accurately understand and respond to medical questions in Vietnamese.
Generate binary string outputs based on the correctness of options for each question.
Aim for high accuracy in answering questions based on the provided dataset.

Dataset

The organizers provide data on common diseases, each accompanied by informative articles covering various aspects of the disease, including causes, symptoms, preventive measures, etc.

Getting started

git clone https://github.com/tnt305/kalapa_challenge_2023.git

Installing

pip install -r requirements.txt

Download embedding model me5 from Huggingface and convert to onnx

You can test it with other mutilingual model git clone https://huggingface.co/intfloat/multilingual-e5-small python convert_onnx.py

Embed Medical Corpus into Vector Storage

python embed_corpus.py

Run model

You can also test with different model strategy related to prompting and quantization + generation config

python main.py --model model_in_the_selection --model_path huggingface_related_model

Anything that you should or want to understand about LLMs

If you're new to quantization,start by reading this.
If you find it hard, consider this
More about text generation configuration.
Convert to Onnx
Different model performance that you might consider to try on

Sources:

The original idea is based Viet and this paper

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dataset		dataset
embedding		embedding
ensemble properties		ensemble properties
fig		fig
llm		llm
retrieval_based		retrieval_based
.gitignore		.gitignore
README.md		README.md
convert_onnx.py		convert_onnx.py
embed_corpus.py		embed_corpus.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KALAPA CHALLENGE _ Vietnamese Medical Question Answering

Public Score

Description

Problem Statement

Input

Output

Instructions

Dataset

Getting started

Installing

Download embedding model me5 from Huggingface and convert to onnx

Embed Medical Corpus into Vector Storage

Run model

Anything that you should or want to understand about LLMs

Sources:

About

Releases

Packages

Languages

tnt305/kalapa_challenge_2023

Folders and files

Latest commit

History

Repository files navigation

KALAPA CHALLENGE _ Vietnamese Medical Question Answering

Public Score

Description

Problem Statement

Input

Output

Instructions

Dataset

Getting started

Installing

Download embedding model me5 from Huggingface and convert to onnx

Embed Medical Corpus into Vector Storage

Run model

Anything that you should or want to understand about LLMs

Sources:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages