For an in-depth look at how I put this project together you can check out the accompanying blog post I wrote.
Retrieval-Augmented Generation (RAG) is a simple yet powerful approach that is going to completely revolutionize search. This project provides a peak into the capabilities of RAG, which even in this simple example showcases how it can significantly outperfom keyword-based search techniques.
The idea behind this project was to create an advanced search engine to discover Huberman Lab Podcast episodes for topics you are interested in. However, as opposed to simple keyword-based search, this RAG-Based approach allows for a much stronger search engine which incorporates the actual context and meaning of what you are trying to find as opposed to just matching keywords. The general approach was as follows:
- Scrape mp3 files from the RSS feed on the Huberman Lab website
- Generate transcriptions of all the mp3 files
- Break down individual episode transcriptions into smaller sections
- Generate word embeddings for each of these smaller sections
- Store all the embeddings in a vector database
- Create a gradio frontend to allow users to utilize the search functionality
Jason Liu, also for the inspiration on doing a RAG project
Huberman Lab Podcast, for all the valuable free information