This repository combines Deephaven with Weaviate to show a simple example of a combination that could be used to power a real-time semantic search engine. It uploads 80 books with descriptions and language from the Skelebor/book_titles_and_descriptions dataset. The dataset itself contains nearly 2 million books. However, this tool uses Weaviate's free tier, which is rate-limited at 100 items/hour. Thus, 80 is chosen to allow some room to perform searches on the data. The books are added to a Book
class (which is created if it doesn't exist). From there, the Deephaven instance creates an input table called book_search
. Add data to this table by double-clicking on cells and typing search terms. Click commit to automatically trigger the book_recommendations
table to update with the top result (both book title and description) returned from Weaviate's vector search engine.
Deephaven is a real-time, column-oriented database query engine that excels with real-time and big data. Weaviate is an open-source vector database.
Combining the two provides a powerful infrastructure that could be the backend for semantic search engines, recommendation engines, and much more. The framework provided by this application can be built upon in many ways - with a frontend UI, more vectors to search, etc.
This uses Deephaven run from Docker. For installation instructions and prerequisites, see here.
For Weaviate's installation instructions, see here.
This application uses Deephaven's application mode to perform configuration on startup. It requires the following environment variables to be set:
DH_PSK
: The Deephaven pre-shared key. If this isn't set, the pre-shared key will either be an empty string, or a randomly generated sequence of characters. If the latter, see How to configure and use pre-shared key authentication.WEAVIATE_ENDPOINT
: Your Weaviate cluster endpoint. If this isn't set properly, you will not be able to upload data to Weaviate, or perform semantic search.WEAVIATE_TOKEN
: Your Weaviate authentication token. If this isn't set properly, you cannot connect to your Weaviate cluster.HUGGINGFACE_TOKEN
: Your Hugging Face token. If this isn't set properly, you will not be able to use its vectorization engines to vectorize text data, resulting the inability to perform semantic search.
To see the scripts that run upon startup, see here.
To start the application, perform:
docker compose up --build
The first time you start the application, it can take a few minutes, since it has to download the entire dataset. Wait for the Docker logs to display Server started on port 10000
, then connect to http://localhost:10000/ide
in your preferred web browser. The Deephaven IDE should appear with a default layout showing the tables book_search
and book_recommendations
on the bottom half. If it doesn't, you can bring them up via the Panels
dropdown menu at the top right of the screen, or by importing the layout file found in data/storage/layouts
via the same dropdown.
If you start this application more than once per hour, you will likely notice the following message appear numerous times on subsequent launches:
{'error': [{'message': 'update vector: failed with status: 429 error: Rate limit reached. You reached free usage limit (reset hourly). Please subscribe to a plan at https://huggingface.co/pricing to use the API at this rate'}]}
This, by itself, is fine, but subsequent entries to the book_search
table will cause the application to crash. Exceeding the rate limit by attempting to perform a search causes an error, which ends the Python process, and subsequently the JVM, and Deephaven.
See the license, Deephaven's code of conduct, and Weaviate's code of conduct for usage guidelines.
This application was last updated in August of 2023, when Deephaven Community Core version 0.27.0 was latest. It is only guaranteed to work on Deephaven versions 0.27.0 and 0.27.1.