Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text #1235

xzdandy · 2023-09-29T03:35:29Z

Search before asking

I have searched the EvaDB issues and found no similar feature requests.

Description

EXTRACT_COLUMNS will be similar to EXTRACT_OBJECT for videos, which is not a standard user defined functions. In optimizer, it will be translated to a valid EvaDB query plan tree with multiple functions and operators.

Example Usage

EXTRACT_COLUMNS(
    "gpt-3.5-turbo", 
    "faiss",
    [
        ["name", "name of the user profile", "logicx"], 
        ["country", "country the user comes from", "United States"],
        ["age", "age of the user", 30],
    ], 
     input_source
)

The first argument specifies the llm model to use
The second argument specifies the vector database to use. If the second column is "", then RAG will not be used. In the first release of EXTRACT_COLUMNS, we will not support RAG.
The third augments specifies the column we want to extract, for every column, we specify
- the name of column
- natural language to describe how to extract that column
- an example value, column type is inferred from the example value.
The fourth is the input_relationship
The output returns a batched panda dataframe that contains those extracted columns. This is a one-to-one mapping for the input_relationship.

If we want to provide more fined grained controls, for example tuning hyper paramters, we can also introduce a CREATE FUNCTION, which allows us to have a key-value based configuration.

@gaurav274 @jiashenC Please provide feedback. Thanks.

Use case

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

pchunduri6 · 2023-09-29T18:20:35Z

[
    ["name", "name of the user profile", "logicx"], 
    ["country", "country the user comes from", "United States"]
],

How would this translate to the LLM prompt in the background -- e.g., one prompt for each column, single prompt by combining all columns
The LLM extraction is brittle, so careful prompt engineering is required. Is it safe to use this structure without providing the option to engineer the prompt?
With RAG queries, there is information loss, so accurate extraction will get trickier. Tracking the output accuracy could be challenging.

xzdandy · 2023-09-29T19:03:15Z

Hi @pchunduri6, very good feedback.

Backend will be translated to what we have similar in the stargazers. Optimization like batching will be applied accordingly. There are also new opportunities like merging, for example, some columns are extracted in predicate while some are extracted in projection.

GPT35Azure("You are given a block of disorganized text extracted from the GitHub user profile of a user using an automated web scraper. The goal is to get structured results from this data.
                Extract the following fields from the text: name, country, city, email, occupation, programming_languages, topics_of_interest, social_media.
                If some field is not found, just output fieldname: N/A. Always return all the 8 field names. DO NOT add any additional text to your output.
                The topic_of_interest field must list a broad range of technical topics that are mentioned in any portion of the text.  This field is the most important, so add as much information as you can. Do not add non-technical interests.
                The programming_languages field can contain one or more programming languages out of only the following 4 programming languages - Python, C++, JavaScript, Java. Do not include any other language outside these 4 languages in the output. If the user is not interested in any of these 4 programming languages, output N/A.
                If the country is not available, use the city field to fill the country. For example, if the city is New York, fill the country as United States.
                If there are social media links, including personal websites, add them to the social media section. Do NOT add social media links that are not present.
                Here is an example (use it only for the output format, not for the content):

                name: logicx
                country: United States
                city: Atlanta
                email: [email protected]
                occupation: PhD student at Georgia Tech
                programming_languages: Python, Java
                topics_of_interest: Google Colab, fake data generation, Postgres
                social_media: https://www.logicx.io, https://www.twitter.com/logicx, https://www.linkedin.com/in/logicx
                ", stargazerscrapeddetails.extracted_text
                )

I have exactly similar thoughts. We can provide a full prompt to the engineer. But non advanced user may not know how to write a proper prompt for this purpose. The proposed interface is more user friendly and simple. I agree it can lose some accuracy but power users can always write the above fully customized query in EvaDB. For this asepct, I am eager to see more feedback on the design.
Feedback on RAG is helpful. Is RAG useful for extracting column information? or when it will be useful, since the current stargazer does not use that. And it is also easier to implement the EXTRACT_COLUMNS without RAG. We need to evaluate the efforts and gains.

hershd23 · 2023-10-18T15:22:28Z

Hey @gaurav274 introduced me to this issue.

Seems interesting. Can I take it up?

xzdandy · 2023-10-18T15:32:30Z

HI @hershd23 , thanks for your interest! Yes!

hershd23 · 2023-10-27T06:48:09Z

https://github.com/hershd23/eva-structure-gpt

Have something up just as a quick and dirty POC. Mostly testing for the testing of the prompt which I build incrementally. I think this is good enough to start work on the function itself

xzdandy added the AI Engines Features, Bugs, related to AI Engines label Sep 29, 2023

xzdandy added this to the v0.3.7 milestone Sep 29, 2023

xzdandy added this to EVA Public Roadmap ⚡🚀 Sep 29, 2023

xzdandy moved this to Ideation in EVA Public Roadmap ⚡🚀 Sep 29, 2023

xzdandy removed this from the v0.3.7 milestone Sep 30, 2023

xzdandy assigned hershd23 Oct 19, 2023

hershd23 linked a pull request Nov 3, 2023 that will close this issue

Unstructured data to structured data conversion via EXTRACT_COLUMN #1338

Open

xzdandy linked a pull request Nov 3, 2023 that will close this issue

Unstructured data to structured data conversion via EXTRACT_COLUMN #1338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text #1235

Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text #1235

xzdandy commented Sep 29, 2023 •

edited

Loading

pchunduri6 commented Sep 29, 2023

xzdandy commented Sep 29, 2023 •

edited

Loading

hershd23 commented Oct 18, 2023

xzdandy commented Oct 18, 2023

hershd23 commented Oct 27, 2023

Introduce EXTRACT_COLUMNS to extract structured tables from unstructured text #1235

Introduce EXTRACT_COLUMNS to extract structured tables from unstructured text #1235

Comments

xzdandy commented Sep 29, 2023 • edited Loading

Search before asking

Description

Use case

Are you willing to submit a PR?

pchunduri6 commented Sep 29, 2023

xzdandy commented Sep 29, 2023 • edited Loading

hershd23 commented Oct 18, 2023

xzdandy commented Oct 18, 2023

hershd23 commented Oct 27, 2023

Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text #1235

Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text #1235

xzdandy commented Sep 29, 2023 •

edited

Loading

xzdandy commented Sep 29, 2023 •

edited

Loading