-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for CONCAT
operator
#1193
Comments
Another constraint to think about is with LLMs, we have a context limit -- so number of rows does not matter
|
Can we just use the underlying database system -- Postgres? |
Thanks for the suggestions!
To handle the context limit of LLM models, we could consider implementing a |
I like the Postgres syntax better. Providing a window functionality and Given the following query works now, can we rely on the underlying database instead of introducing the window function to EvaDB for now, which is non trivial and requires high efforts?
If there is a usecase that we can not elegantly push down to the underlying database, please share. Thanks! |
Sounds good. For now, let's go with Postgres queries for window function capabilities. |
Search before asking
Description
I'm planning to add a new
CONCAT
operator to EvaDB that combines context from multiple textual rows, inspired by theWINDOW
operator introduced here:https://www.vldb.org/pvldb/vol8/p1058-leis.pdf
Motivation
The current workflow for LLM-based text apps is as follows:
story_table (id INTEGER, paragraph TEXT(1000));
Each row in the
story_table
contains a single paragraph in the input document. The resultingfeature_table
contains the embedding for a single paragraph in each row.Any subsequent processing, such as similarity search or indexing, only works at the paragraph level. This often results in poor LLM accuracy. For example, below are the results of a sample question using the story_qa app:
If we could instead combine paragraphs while creating the embeddings and index, we have better control over the context given to the LLM model. The output for the same question when concatenating the input with 1 preceding and 1 succeeding paragraph:
The existing
GROUP BY
operator combines multiple paragraphs into a new group. However, it can only group multiple paragraphs into a single Pythonic list and cannot add context to each individual data point.Proposed workflow
The proposed
CONCAT
operator concatenates multiple rows to add additional context toeach data point
. The operator can only be used with aSELECT
query to augment the textual column. It cannot filter or alter the existing tables.An example workflow with the CONCAT operator is as follows:
story_table (id INTEGER, paragraph TEXT(1000));
Here are a few things to discuss:
CONCAT
operation (and subsequent processing) would be expensive for large documents due to the repeating of tokens. We must find ways to avoid or offset this cost.WINDOW
operator is useful for structured data (as proposed in the paper) - it might have use cases in forecasting and statistical analysis. If possible, we could find a consistent syntax for structured + unstructured data.CONCAT
operator mean for audio and videos?GROUP BY
might suffice for them.@jarulraj @gaurav274 @xzdandy @jiashenC Any early feedback and discussion is highly appreciated, thanks!
Use case
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: