This project is a simple RAG tool for asking questions related to some vnexpress articles.
This project is to demonstrate how RAG can be easily implemented without buzzy frameworks such as LangChain or LLamaIndex. Therefor, people can integrate RAG into their own system/project.
This projects uses:
- Scrapy for getting plain text of articles from vnexpress giao duc tin tuc
- Mistral Platform for both embedding and language models. They are
mistral-embed
andopen-mistral-nemo
- Upstash Vector for vector database
Python 3.10
Install required packages, please see requirements.txt
for extra information
pip install -r requirements.txt
Mistral API key and Upstash API key are stored at .env
MISTRAL_API_KEY=<key_here>
UPSTASH_VECTOR_REST_URL=<key_here>
UPSTASH_VECTOR_REST_TOKEN=<key_here>
It is strongly advised to reach each .py file before running any command. By doing so, you get to understand the project more.
At root project
scrapy runspider --set FEED_EXPORT_ENCODING=utf-8 src/vnexpress_spider.py -o data/articles.jsonl
Scrapy won't overwrite data/articles.jsonl
if it already exists. If you want new data, you have to delete the file.
At root project
python src/setup_db.py
If a vector database already exists and data/articles.jsonl
changes, you should delete the database.
You should definitely edit query
variable in src/rag.py
.
At root project
python src/rag.py