Vision is All You Need: V-RAG (Vision RAG) Demo

This is a demo of the Vision RAG (V-RAG) architecture.

The V-RAG architecture utilizes a vision language model (VLM) to embed pages of PDF files (or any other document) as vectors directly, without the tedious chunking process.

Vision.is.All.You.Need.-.Demo.Video.mp4

Check out the background blog post: https://softlandia.fi/en/blog/building-a-rag-tired-of-chunking-maybe-vision-is-all-you-need

How does V-RAG work?

The pages of a PDF file are converted to images.
- In theory these images can be anything, but the current demo uses PDF files since the underlying model has been trained on PDF files
- pypdfium is used to convert the PDF pages to images
The images are passed through a VLM to get the embeddings.
- ColPali is used as the VLM in this demo
The embeddings are stored in a database
- QDrant is used as the vector database in this demo
The user passes a query to the V-RAG system
The query is passed through the VLM to get the query embedding
The query embedding is used to search the vector database for similar embeddings
The user query and images of the best matches from the search are passed again to a model that can understand images
- we use GPT4o or GPT4o-mini in this demo
The model generates a response based on the query and the images

How to run the demo?

Make sure tou have an account in Hugging Face. Make sure you are logged into Hugging Face using transformers-cli login.

For OpenAI API, you need to have an API key. You can get it from here: https://platform.openai.com/account/api-keys

You can place the keys to the dotenv file:

OPENAI_API_KEY=
HF_TOKEN=

Then, you can run the demo by following these steps:

Install Python 3.11 or higher
pip install modal
modal setup
modal serve \main.py

How to use the demo from the provided API?

Open your browser and go to the url provided by Modal and append /docs to the url
Click on the POST /collections endpoint
Click on the Try it out button
Upload a PDF file
Click on the Execute button

This will index the PDF file in to in-memory vector database. This will take some time depending on the size of the PDF file and the GPU you are using in Modal. The current demo is using a A10G GPU.

You can now search for similar pages using the POST /search endpoint.

The endpoint sends the page images and the query to the OpenAI API and returns the response.

Frontend

You can also use the frontend to interact with the API. To setup the frontend for local development, follow these steps:

Install Node.js
cd frontend
- modify you .env.development file and add your VITE_BACKEND_URL
npm install
npm run dev

This will start the frontend on http://localhost:5173

How to deploy the demo?

You can deploy the demo to Modal using the following steps:

Modify you .env.production file in frontend dir and add your VITE_BACKEND_URL for the production environment
Build the frontend npm run build - this will create a dist folder with the frontend bundle
modal deploy .\main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vision is All You Need: V-RAG (Vision RAG) Demo

How does V-RAG work?

How to run the demo?

How to use the demo from the provided API?

Frontend

How to deploy the demo?

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vision is All You Need: V-RAG (Vision RAG) Demo

How does V-RAG work?

How to run the demo?

How to use the demo from the provided API?

Frontend

How to deploy the demo?