Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce EXTRACT_COLUMNS to extract structured tables from unstructured text #1235

Open
1 of 2 tasks
xzdandy opened this issue Sep 29, 2023 · 5 comments · May be fixed by #1338
Open
1 of 2 tasks

Introduce EXTRACT_COLUMNS to extract structured tables from unstructured text #1235

xzdandy opened this issue Sep 29, 2023 · 5 comments · May be fixed by #1338
Assignees
Labels
AI Engines Features, Bugs, related to AI Engines

Comments

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 29, 2023

Search before asking

  • I have searched the EvaDB issues and found no similar feature requests.

Description

EXTRACT_COLUMNS will be similar to EXTRACT_OBJECT for videos, which is not a standard user defined functions. In optimizer, it will be translated to a valid EvaDB query plan tree with multiple functions and operators.

Example Usage

EXTRACT_COLUMNS(
    "gpt-3.5-turbo", 
    "faiss",
    [
        ["name", "name of the user profile", "logicx"], 
        ["country", "country the user comes from", "United States"],
        ["age", "age of the user", 30],
    ], 
     input_source
)
  • The first argument specifies the llm model to use
  • The second argument specifies the vector database to use. If the second column is "", then RAG will not be used. In the first release of EXTRACT_COLUMNS, we will not support RAG.
  • The third augments specifies the column we want to extract, for every column, we specify
    • the name of column
    • natural language to describe how to extract that column
    • an example value, column type is inferred from the example value.
  • The fourth is the input_relationship
  • The output returns a batched panda dataframe that contains those extracted columns. This is a one-to-one mapping for the input_relationship.

If we want to provide more fined grained controls, for example tuning hyper paramters, we can also introduce a CREATE FUNCTION, which allows us to have a key-value based configuration.

@gaurav274 @jiashenC Please provide feedback. Thanks.

Use case

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@xzdandy xzdandy added the AI Engines Features, Bugs, related to AI Engines label Sep 29, 2023
@xzdandy xzdandy added this to the v0.3.7 milestone Sep 29, 2023
@xzdandy xzdandy moved this to Ideation in EVA Public Roadmap ⚡🚀 Sep 29, 2023
@pchunduri6
Copy link
Contributor

[
    ["name", "name of the user profile", "logicx"], 
    ["country", "country the user comes from", "United States"]
], 
  1. How would this translate to the LLM prompt in the background -- e.g., one prompt for each column, single prompt by combining all columns
  2. The LLM extraction is brittle, so careful prompt engineering is required. Is it safe to use this structure without providing the option to engineer the prompt?
  3. With RAG queries, there is information loss, so accurate extraction will get trickier. Tracking the output accuracy could be challenging.

@xzdandy
Copy link
Collaborator Author

xzdandy commented Sep 29, 2023

Hi @pchunduri6, very good feedback.

  1. Backend will be translated to what we have similar in the stargazers. Optimization like batching will be applied accordingly. There are also new opportunities like merging, for example, some columns are extracted in predicate while some are extracted in projection.
GPT35Azure("You are given a block of disorganized text extracted from the GitHub user profile of a user using an automated web scraper. The goal is to get structured results from this data.
                Extract the following fields from the text: name, country, city, email, occupation, programming_languages, topics_of_interest, social_media.
                If some field is not found, just output fieldname: N/A. Always return all the 8 field names. DO NOT add any additional text to your output.
                The topic_of_interest field must list a broad range of technical topics that are mentioned in any portion of the text.  This field is the most important, so add as much information as you can. Do not add non-technical interests.
                The programming_languages field can contain one or more programming languages out of only the following 4 programming languages - Python, C++, JavaScript, Java. Do not include any other language outside these 4 languages in the output. If the user is not interested in any of these 4 programming languages, output N/A.
                If the country is not available, use the city field to fill the country. For example, if the city is New York, fill the country as United States.
                If there are social media links, including personal websites, add them to the social media section. Do NOT add social media links that are not present.
                Here is an example (use it only for the output format, not for the content):

                name: logicx
                country: United States
                city: Atlanta
                email: [email protected]
                occupation: PhD student at Georgia Tech
                programming_languages: Python, Java
                topics_of_interest: Google Colab, fake data generation, Postgres
                social_media: https://www.logicx.io, https://www.twitter.com/logicx, https://www.linkedin.com/in/logicx
                ", stargazerscrapeddetails.extracted_text
                )
  1. I have exactly similar thoughts. We can provide a full prompt to the engineer. But non advanced user may not know how to write a proper prompt for this purpose. The proposed interface is more user friendly and simple. I agree it can lose some accuracy but power users can always write the above fully customized query in EvaDB. For this asepct, I am eager to see more feedback on the design.
  2. Feedback on RAG is helpful. Is RAG useful for extracting column information? or when it will be useful, since the current stargazer does not use that. And it is also easier to implement the EXTRACT_COLUMNS without RAG. We need to evaluate the efforts and gains.

@xzdandy xzdandy removed this from the v0.3.7 milestone Sep 30, 2023
@hershd23
Copy link
Contributor

Hey @gaurav274 introduced me to this issue.

Seems interesting. Can I take it up?

@xzdandy
Copy link
Collaborator Author

xzdandy commented Oct 18, 2023

HI @hershd23 , thanks for your interest! Yes!

@hershd23
Copy link
Contributor

https://github.com/hershd23/eva-structure-gpt

Have something up just as a quick and dirty POC. Mostly testing for the testing of the prompt which I build incrementally. I think this is good enough to start work on the function itself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Engines Features, Bugs, related to AI Engines
Projects
Development

Successfully merging a pull request may close this issue.

3 participants