Update Nov 2024

Fast vector search with FAISS, improve retrieval time by 10x
Local Llama replacing OpenAI's API, ensure absolute privacy - data never leaves computer

Installation

First install llama.cpp then host a prediction sever locally.

brew install llama.cpp
llama-server --hf-repo hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF --hf-file llama-3.2-1b-instruct-q8_0.gguf -c 2048

Then install and run the search engine normally

pip install -r requirements.txt
python main.py

Cetasearch should be read at 127.0.0.1:5000, ask away ~

Update March 4, 2023

Incorporate new ChatGPT API (gpt-3.5-turbo). Reinforcement Learning with Human Feedback helps this model achieve better performance than text-davinci-003 with 10% of the inference cost.

On February 7, 2023, Microsoft launched a new version of Bing that included ChatGPT to enhance search results. Inspired by this advancement and the movie "Avatar: The Way of Water," I created Cetasearch, a search engine that focuses on knowledge about the ocean and cetacean species such as whales, orcas, dolphins, and more. Cetasearch offers conversational answers with detailed links to the sources, which I hope could play a small role in the conservation of the ocean and marine wildlife.

The search engine has three main functionalities: a semantic search engine, generative text completion for conversational answers, and annotations that match answers with sources. Everything was built using publicly available NLP pre-trained models and APIs, such as Sentence transformer, OpenAI's GPT3.5 API, and multiple pre-trained LLMs during development.

Data crawling, text processing and indexing is pre-calculated in batch to save inference time

Due to my limited computational power of my computer, multiple measures were taken to narrow the scope of this project while maintaining the core functionalities:

Focus on the ocean and cetacean species. The data sources were limited to a small subset of Wikipedia articles centered around these topics.
Preprocess and index the source documents to save processing time.
Leverage OpenAI's GPT3.5 API for text completion. During the experimentation phase, I tried a few open-source LLM models as well, but OpenAI's API provided the best answers by a large margin.

Cetasearch provide a better answer to a generic answer than Google's current feature snippet!

Other open-ended questions got good answers as well.

Features

Currently available features

The following features are currently available:

A semantic search engine based on thousands of Wikipedia articles related to the ocean. Due to the tedious and messy nature of data processing, only the end results (processed text and indexes) has been included in the codebase.
The ability to generate conversational answers using OpenAI's API (text-davinci-003). Although GPT3.5 has been found to produce the best results, other text summarization and completion LLMs can be used in its place.
Annotation of generated answers at the sentence level, linking pieces of the answer to their source materials.

Future developments

Fact-checking of generated answers, which is a technically challenging problem that even large companies like Bing and Google are still working on. In the meantime, users will have to fact-check the answers themselves by referring to the annotated sources.
Annotation at the word/phrase level (possibly using non-maximum suppression, inspired by the computer vision bounding-box drawing task).
Scaling the search engine beyond limited pre-processed data focusing only on the ocean and whales. A possible next step is to develop a general-purpose search engine.
Improving response speed by finding ways to reduce the time it takes to rely on OpenAI's API.

Limitations

As previously noted, Cetasearch currently faces several limitations, including speed, the amount of indexed web documents, and the lack of a fact-checking module. These limitations can impact the accuracy and usefulness of search results, and efforts are underway to address these issues and enhance the functionality of the tool.

Installation

Make sure you include your OpenAI API key as an environment variable under OPENAI_API_KEY name. You can sign up for a 3-month free trial here and access your API key here
Install requirements pip install -r requirements.txt
Run python main.py to initiate the web server
Access the search UI using your web browser at 127.0.0.1:5000

Credits & Referrences

Wiki2txt - for text processing of wikipedia archive
Sentence transformer - main component of semantic search
OpenAI API - main component for text generation
Avatar - The way of water and the Whale conservation efforts by multiple organizations are the inspiration of this project.
Tech stack: Python (Numpy, Pandas, Flask,etc.), Transformer-based Large Language Models

Note: As of Feb 19, 2023, Microsoft's New Bing is still in beta testing and not yet release to the public.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
etc		etc
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.yaml		app.yaml
main.py		main.py
requirements.txt		requirements.txt
semantic_search_engine.py		semantic_search_engine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Update Nov 2024

Installation

Update March 4, 2023

Features

Currently available features

Future developments

Limitations

Installation

Credits & Referrences

About

Releases

Packages

Languages

License

trantrikien239/cetasearch

Folders and files

Latest commit

History

Repository files navigation

Update Nov 2024

Installation

Update March 4, 2023

Features

Currently available features

Future developments

Limitations

Installation

Credits & Referrences

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages