gtsearch

GTSearch is a search engine tailored for domain-specific inquiries related to Georgia Tech. It utilizes data obtained through a domain-specific web crawler, implemented with Scrapy as the crawling framework. Additionally, it employs a relevance engine powered by vector similarity search. We utilize Pinecone as the vector database to retrieve the top k similar documents, which are then passed as context to the OpenAI API to obtain the desired answer.

system design

System design for the crawling module

System design for the RAG module

Running Instructions

Install scrapy using pip

pip install scrapy

To run a crawl and insert relevant documents into pinecone

scrapy crawl tsearch -o search.json

File organisation

Spiders The 'spiders' folder contains the primary GTSearch spider along with other middleware required to operate the web crawler. We've integrated custom logic for comparing the crawled text with the base text using a vector similarity search, powered by Fast-embed. After obtaining the relevant documents, they are pushed into Pinecone, which serves as a vector database.

Server The 'server' folder houses the Flask web server responsible for hosting our search engine on the web. To run the server

python app.py

The endpoint '/tsearch/search' is a POST endpoint which takes a user query and gets the top k documents relevant to the user query from pinecone and we pass this as context to open-ai api to get the relevant answers.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
server		server
spiders		spiders
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
items.py		items.py
middlewares.py		middlewares.py
pipelines.py		pipelines.py
search.json		search.json
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gtsearch

system design

Running Instructions

File organisation

About

Releases

Packages

Contributors 2

Languages

kslohith/gtsearch

Folders and files

Latest commit

History

Repository files navigation

gtsearch

system design

Running Instructions

File organisation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages