Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for CONCAT operator #1193

Closed
2 tasks done
pchunduri6 opened this issue Sep 22, 2023 · 5 comments
Closed
2 tasks done

Support for CONCAT operator #1193

pchunduri6 opened this issue Sep 22, 2023 · 5 comments
Assignees
Labels
Milestone

Comments

@pchunduri6
Copy link
Contributor

pchunduri6 commented Sep 22, 2023

Search before asking

  • I have searched the EvaDB issues and found no similar feature requests.

Description

I'm planning to add a new CONCAT operator to EvaDB that combines context from multiple textual rows, inspired by the WINDOW operator introduced here:
https://www.vldb.org/pvldb/vol8/p1058-leis.pdf

Motivation

The current workflow for LLM-based text apps is as follows:

story_table (id INTEGER, paragraph TEXT(1000));

CREATE TABLE feature_table AS
SELECT SentenceFeatureExtractor(paragraph), paragraph FROM story_table;

Each row in the story_table contains a single paragraph in the input document. The resulting feature_table contains the embedding for a single paragraph in each row.

Any subsequent processing, such as similarity search or indexing, only works at the paragraph level. This often results in poor LLM accuracy. For example, below are the results of a sample question using the story_qa app:

Enter your question (type 'exit' to stop): How is Natasha described?

Natasha is not described in this particular context. The context mentions a question asked to Natasha about someone else and then moves on to various conversations and descriptions of other people present in the scene.

If we could instead combine paragraphs while creating the embeddings and index, we have better control over the context given to the LLM model. The output for the same question when concatenating the input with 1 preceding and 1 succeeding paragraph:

Natasha is described as a "black-eyed, wide-mouthed girl not pretty but full of life," who is at the "charming age when a girl is no longer a child though the child is not yet a young woman." She is also prone to bouts of laughter and seems to have a close relationship with her mother.

The existing GROUP BY operator combines multiple paragraphs into a new group. However, it can only group multiple paragraphs into a single Pythonic list and cannot add context to each individual data point.

Proposed workflow

The proposed CONCAT operator concatenates multiple rows to add additional context to each data point. The operator can only be used with a SELECT query to augment the textual column. It cannot filter or alter the existing tables.

An example workflow with the CONCAT operator is as follows:

story_table (id INTEGER, paragraph TEXT(1000));

SELECT id, paragraph FROM story_table
CONCAT BY paragraph
ROW BETWEEN 2 PRECEDING AND 2 FOLLOWING;

Here are a few things to discuss:

  1. This operation could be done on the app side as well, where users load text of different lengths into EvaDB. However, the optimal length is difficult to predict, and its impact on the dollar cost is non-trivial to calculate. Using an operator might allow us to optimize the dollar cost to the end user (see point 2).
  2. The token usage depends on the window size. Depending on the use case, the user could choose the window size to save on dollar cost - e.g., small windows if the document contains simple information like FAQ.
  3. The CONCAT operation (and subsequent processing) would be expensive for large documents due to the repeating of tokens. We must find ways to avoid or offset this cost.
  4. The related WINDOW operator is useful for structured data (as proposed in the paper) - it might have use cases in forecasting and statistical analysis. If possible, we could find a consistent syntax for structured + unstructured data.
  5. What does the CONCAT operator mean for audio and videos? GROUP BY might suffice for them.

@jarulraj @gaurav274 @xzdandy @jiashenC Any early feedback and discussion is highly appreciated, thanks!

Use case

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@pchunduri6 pchunduri6 self-assigned this Sep 22, 2023
@pchunduri6 pchunduri6 added the Feature Request ✨ New feature or request label Sep 22, 2023
@jarulraj
Copy link
Member

Another constraint to think about is with LLMs, we have a context limit -- so number of rows does not matter

SELECT id, paragraph FROM story_table
CONCAT BY paragraph
TOKEN BETWEEN 100 PRECEDING AND 100 FOLLOWING;

@jarulraj
Copy link
Member

Can we just use the underlying database system -- Postgres?

@xzdandy xzdandy added this to the v0.3.7 milestone Sep 22, 2023
@pchunduri6
Copy link
Contributor Author

pchunduri6 commented Sep 22, 2023

Thanks for the suggestions!
Postgres supports the CONCAT/WINDOW functionality. I verified that we can get the desired outcome using the following query:

cursor.query("""
  USE pg_db {
    SELECT id, string_agg(paragraph, '.')
    OVER (ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
    FROM story_table; }
""")

To handle the context limit of LLM models, we could consider implementing a MAX_TOKEN constraint on the output of the above query. The models (e.g., SentenceTransformer) already truncate the input sentences based on the model limits (e.g., model.max_seq_length). The MAX_TOKEN constraint might still help us control the model cost more explicitly.

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 22, 2023

I like the Postgres syntax better. Providing a window functionality and string_agg, which can also be used for creating a moving windows on videos, if replacing string_agg with some function like image_agg.

Given the following query works now, can we rely on the underlying database instead of introducing the window function to EvaDB for now, which is non trivial and requires high efforts?

cursor.query("""
  USE pg_db {
    SELECT id, string_agg(paragraph, '.')
    OVER (ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
    FROM story_table; }
""")

If there is a usecase that we can not elegantly push down to the underlying database, please share. Thanks!

@pchunduri6
Copy link
Contributor Author

Sounds good. For now, let's go with Postgres queries for window function capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants