Support for `CONCAT` operator #1193

pchunduri6 · 2023-09-22T04:09:16Z

Search before asking

I have searched the EvaDB issues and found no similar feature requests.

Description

I'm planning to add a new CONCAT operator to EvaDB that combines context from multiple textual rows, inspired by the WINDOW operator introduced here:
https://www.vldb.org/pvldb/vol8/p1058-leis.pdf

Motivation

The current workflow for LLM-based text apps is as follows:

story_table (id INTEGER, paragraph TEXT(1000));

CREATE TABLE feature_table AS
SELECT SentenceFeatureExtractor(paragraph), paragraph FROM story_table;

Each row in the story_table contains a single paragraph in the input document. The resulting feature_table contains the embedding for a single paragraph in each row.

Any subsequent processing, such as similarity search or indexing, only works at the paragraph level. This often results in poor LLM accuracy. For example, below are the results of a sample question using the story_qa app:

Enter your question (type 'exit' to stop): How is Natasha described?

Natasha is not described in this particular context. The context mentions a question asked to Natasha about someone else and then moves on to various conversations and descriptions of other people present in the scene.

If we could instead combine paragraphs while creating the embeddings and index, we have better control over the context given to the LLM model. The output for the same question when concatenating the input with 1 preceding and 1 succeeding paragraph:

Natasha is described as a "black-eyed, wide-mouthed girl not pretty but full of life," who is at the "charming age when a girl is no longer a child though the child is not yet a young woman." She is also prone to bouts of laughter and seems to have a close relationship with her mother.

The existing GROUP BY operator combines multiple paragraphs into a new group. However, it can only group multiple paragraphs into a single Pythonic list and cannot add context to each individual data point.

Proposed workflow

The proposed CONCAT operator concatenates multiple rows to add additional context to each data point. The operator can only be used with a SELECT query to augment the textual column. It cannot filter or alter the existing tables.

An example workflow with the CONCAT operator is as follows:

story_table (id INTEGER, paragraph TEXT(1000));

SELECT id, paragraph FROM story_table
CONCAT BY paragraph
ROW BETWEEN 2 PRECEDING AND 2 FOLLOWING;

Here are a few things to discuss:

This operation could be done on the app side as well, where users load text of different lengths into EvaDB. However, the optimal length is difficult to predict, and its impact on the dollar cost is non-trivial to calculate. Using an operator might allow us to optimize the dollar cost to the end user (see point 2).
The token usage depends on the window size. Depending on the use case, the user could choose the window size to save on dollar cost - e.g., small windows if the document contains simple information like FAQ.
The CONCAT operation (and subsequent processing) would be expensive for large documents due to the repeating of tokens. We must find ways to avoid or offset this cost.
The related WINDOW operator is useful for structured data (as proposed in the paper) - it might have use cases in forecasting and statistical analysis. If possible, we could find a consistent syntax for structured + unstructured data.
What does the CONCAT operator mean for audio and videos? GROUP BY might suffice for them.

@jarulraj @gaurav274 @xzdandy @jiashenC Any early feedback and discussion is highly appreciated, thanks!

Use case

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

jarulraj · 2023-09-22T04:41:53Z

Another constraint to think about is with LLMs, we have a context limit -- so number of rows does not matter

SELECT id, paragraph FROM story_table
CONCAT BY paragraph
TOKEN BETWEEN 100 PRECEDING AND 100 FOLLOWING;

jarulraj · 2023-09-22T05:41:50Z

Can we just use the underlying database system -- Postgres?

pchunduri6 · 2023-09-22T06:20:42Z

Thanks for the suggestions!
Postgres supports the CONCAT/WINDOW functionality. I verified that we can get the desired outcome using the following query:

cursor.query("""
  USE pg_db {
    SELECT id, string_agg(paragraph, '.')
    OVER (ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
    FROM story_table; }
""")

To handle the context limit of LLM models, we could consider implementing a MAX_TOKEN constraint on the output of the above query. The models (e.g., SentenceTransformer) already truncate the input sentences based on the model limits (e.g., model.max_seq_length). The MAX_TOKEN constraint might still help us control the model cost more explicitly.

xzdandy · 2023-09-22T07:24:09Z

I like the Postgres syntax better. Providing a window functionality and string_agg, which can also be used for creating a moving windows on videos, if replacing string_agg with some function like image_agg.

Given the following query works now, can we rely on the underlying database instead of introducing the window function to EvaDB for now, which is non trivial and requires high efforts?

cursor.query("""
  USE pg_db {
    SELECT id, string_agg(paragraph, '.')
    OVER (ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
    FROM story_table; }
""")

If there is a usecase that we can not elegantly push down to the underlying database, please share. Thanks!

pchunduri6 · 2023-09-26T00:41:29Z

Sounds good. For now, let's go with Postgres queries for window function capabilities.

pchunduri6 self-assigned this Sep 22, 2023

pchunduri6 added the Feature Request ✨ New feature or request label Sep 22, 2023

xzdandy mentioned this issue Sep 22, 2023

String CONCAT Function #1074

Closed

2 tasks

xzdandy added the User Experience label Sep 22, 2023

xzdandy added this to EVA Public Roadmap ⚡🚀 Sep 22, 2023

xzdandy moved this to Ideation in EVA Public Roadmap ⚡🚀 Sep 22, 2023

xzdandy added this to the v0.3.7 milestone Sep 22, 2023

pchunduri6 closed this as completed Sep 26, 2023

github-project-automation bot moved this from Ideation to Done in EVA Public Roadmap ⚡🚀 Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `CONCAT` operator #1193

Support for `CONCAT` operator #1193

pchunduri6 commented Sep 22, 2023 •

edited

Loading

jarulraj commented Sep 22, 2023

jarulraj commented Sep 22, 2023

pchunduri6 commented Sep 22, 2023 •

edited

Loading

xzdandy commented Sep 22, 2023

pchunduri6 commented Sep 26, 2023

Support for CONCAT operator #1193

Support for CONCAT operator #1193

Comments

pchunduri6 commented Sep 22, 2023 • edited Loading

Search before asking

Description

Motivation

Proposed workflow

Use case

Are you willing to submit a PR?

jarulraj commented Sep 22, 2023

jarulraj commented Sep 22, 2023

pchunduri6 commented Sep 22, 2023 • edited Loading

xzdandy commented Sep 22, 2023

pchunduri6 commented Sep 26, 2023

Support for `CONCAT` operator #1193

Support for `CONCAT` operator #1193

pchunduri6 commented Sep 22, 2023 •

edited

Loading

pchunduri6 commented Sep 22, 2023 •

edited

Loading