Unstructured data to structured data conversion via `EXTRACT_COLUMN` #1338

hershd23 · 2023-11-03T08:26:30Z

Added custom function for extracting columns from unstructured data
new file: ../evadb/functions/extract_columns.py

new file: ../evadb/functions/extract_columns.py

hershd23 · 2023-11-03T08:27:35Z

@xzdandy I created a python notebook as well but it gets gitignored while the rest of the tutorial notebooks don't any idea?

new file: 20-structured-data.ipynb

review-notebook-app · 2023-11-03T08:35:37Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

hershd23 · 2023-11-03T08:36:34Z

Solves #1235

evadb/functions/extract_columns.py

hershd23 · 2023-11-28T02:42:53Z

@xzdandy @pchunduri6 moved to a "one-column-at-a-time" implementation as you recommended.

The notebook has the implementation

modified: .gitignore Added file to extract on column at a time new file: evadb/functions/extract_column.py Removed the previous implementation deleted: evadb/functions/extract_columns.py Updated the notebook modified: tutorials/20-structured-data.ipynb

hershd23 · 2023-11-29T04:59:29Z

For one column at a time I think this PR is ready for review @xzdandy @pchunduri6.

For the other changes discussed with either of you, I think it makes sense to take that up in a separate PR else this will bloat. Let me know what you think

…nto extract_columns

evadb/functions/extract_column.py

tutorials/20-structured-data.ipynb

xzdandy · 2023-11-29T09:02:43Z

Can we also add a long integration test for the function under https://github.com/georgia-tech-db/evadb/tree/staging/test/integration_tests/long/functions? We can skip the test in circle ci due to openai key, but I think it is good to have one.

It can either be end-to-end (i.e., SQL queries) or directly test the function class.

hershd23 · 2023-11-29T21:01:56Z

Yes @xzdandy on it

modified: evadb/functions/extract_column.py new file: test/integration_tests/long/functions/test_extract_column.py modified: tutorials/20-structured-data.ipynb

hershd23 · 2023-11-30T07:26:34Z

Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that

xzdandy · 2023-12-01T07:26:35Z

tutorials/20-structured-data.ipynb

We need to skip the notebook test at

evadb/script/test/test.sh

Line 97 in c2457b2

PYTHONPATH=./ python -m pytest --durations=5 --nbmake --overwrite "./tutorials" --capture=sys --tb=short -v --log-level=WARNING --nbmake-timeout=3000 --ignore="tutorials/08-chatgpt.ipynb" --ignore="tutorials/14-food-review-tone-analysis-and-response.ipynb" --ignore="tutorials/15-AI-powered-join.ipynb" --ignore="tutorials/16-homesale-forecasting.ipynb" --ignore="tutorials/17-home-rental-prediction.ipynb" --ignore="tutorials/18-stable-diffusion.ipynb" --ignore="tutorials/19-employee-classification-prediction.ipynb"

due to open ai key

xzdandy · 2023-12-01T07:35:33Z

Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that

Remove the last empty cell.

hershd23 · 2023-12-01T22:31:43Z

12-01-2023 17:31:12 [check_notebook_format:295] ERROR: ERROR: Notebook /Users/hershdhillon23/projects/evadb/script/formatting/../../tutorials/20-structured-data.ipynb does not contain correct Colab link -- update the link.

Do not have a collar link right now

xzdandy · 2023-12-02T06:24:11Z

12-01-2023 17:31:12 [check_notebook_format:295] ERROR: ERROR: Notebook /Users/hershdhillon23/projects/evadb/script/formatting/../../tutorials/20-structured-data.ipynb does not contain correct Colab link -- update the link.

Do not have a collar link right now

The current notebook actually does not work on the colab. I was trying to make it work yesterday and I think it needs several modifications. One fix can help is that can you add the EXTRACT_COLUMN to bootstrap functions in https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/function_bootstrap_queries.py

pchunduri6 · 2023-12-04T15:07:44Z

Should we perform this operation using ChatGPT directly or use something like pandasAI to write a function using LLM and then extract the column we need? Writing a function is much cheaper token cost-wise, but less robust.
@hershd23 @xzdandy Any thoughts?

xzdandy · 2023-12-04T20:08:54Z

Should we perform this operation using ChatGPT directly or use something like pandasAI to write a function using LLM and then extract the column we need? Writing a function is much cheaper token cost-wise, but less robust.
@hershd23 @xzdandy Any thoughts?

Hi @pchunduri6, I think it depends on the task. If the extract column is based on patterns, I think we can generate regex for saving the cost and improve efficiency. On the other hand, if the task is semantic based, we need to rely on the LLM to extract the information.

Added custom function for extracting columns from unstructured data

26dc5bb

new file: ../evadb/functions/extract_columns.py

hershd23 requested review from xzdandy and gaurav274 November 3, 2023 08:26

hershd23 marked this pull request as draft November 3, 2023 08:26

Adding notebook for structured data conversion

38f52e3

new file: 20-structured-data.ipynb

hershd23 changed the title ~~[WIP] Unstructured data to structured data conversion~~ [WIP] Unstructured data to structured data conversion via EXTRACT_COLUMNS Nov 3, 2023

hershd23 requested a review from pchunduri6 November 3, 2023 08:36

xzdandy assigned hershd23 Nov 3, 2023

xzdandy added High Effort 🏋 Difficult solution or problem to solve AI Engines Features, Bugs, related to AI Engines labels Nov 3, 2023

xzdandy linked an issue Nov 3, 2023 that may be closed by this pull request

Introduce EXTRACT_COLUMNS to extract structured tables from unstructured text #1235

Open

2 tasks

xzdandy reviewed Nov 11, 2023

View reviewed changes

evadb/functions/extract_columns.py Outdated Show resolved Hide resolved

xzdandy reviewed Nov 11, 2023

View reviewed changes

evadb/functions/extract_columns.py Outdated Show resolved Hide resolved

hershd23 commented Nov 26, 2023

View reviewed changes

evadb/functions/extract_columns.py Outdated Show resolved Hide resolved

hershd23 force-pushed the extract_columns branch from cadedd1 to f38f866 Compare November 28, 2023 02:49

hershd23 force-pushed the extract_columns branch from d84423f to a77dd26 Compare November 28, 2023 03:06

Merge branch 'staging' into extract_columns

6aa79a4

hershd23 marked this pull request as ready for review November 29, 2023 04:58

Hersh Dhillon added 2 commits November 29, 2023 00:03

Linter checks

4d318f8

Merge branch 'extract_columns' of https://github.com/hershd23/evadb i…

4145896

…nto extract_columns

hershd23 changed the title ~~[WIP] Unstructured data to structured data conversion via EXTRACT_COLUMNS~~ Unstructured data to structured data conversion via EXTRACT_COLUMNS Nov 29, 2023

hershd23 changed the title ~~Unstructured data to structured data conversion via EXTRACT_COLUMNS~~ Unstructured data to structured data conversion via EXTRACT_COLUMN Nov 29, 2023

xzdandy reviewed Nov 29, 2023

View reviewed changes

evadb/functions/extract_column.py Show resolved Hide resolved

xzdandy reviewed Nov 29, 2023

View reviewed changes

tutorials/20-structured-data.ipynb Show resolved Hide resolved

tutorials/20-structured-data.ipynb Show resolved Hide resolved

tutorials/20-structured-data.ipynb Show resolved Hide resolved

Hersh Dhillon added 2 commits November 30, 2023 02:15

Adding test for extract_column

4010d49

modified: evadb/functions/extract_column.py new file: test/integration_tests/long/functions/test_extract_column.py modified: tutorials/20-structured-data.ipynb

Resolved review comments

6f9a1c2

xzdandy requested changes Dec 1, 2023

View reviewed changes

Hersh Dhillon added 2 commits December 1, 2023 17:08

Solving for linter changes due to the notebook test

0fc6771

Solving for linter changes

7b19174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstructured data to structured data conversion via `EXTRACT_COLUMN` #1338

Unstructured data to structured data conversion via `EXTRACT_COLUMN` #1338

hershd23 commented Nov 3, 2023

hershd23 commented Nov 3, 2023

review-notebook-app bot commented Nov 3, 2023

hershd23 commented Nov 3, 2023

hershd23 commented Nov 28, 2023 •

edited

Loading

hershd23 commented Nov 29, 2023

xzdandy commented Nov 29, 2023

hershd23 commented Nov 29, 2023

hershd23 commented Nov 30, 2023

xzdandy Dec 1, 2023

xzdandy commented Dec 1, 2023

hershd23 commented Dec 1, 2023

xzdandy commented Dec 2, 2023

pchunduri6 commented Dec 4, 2023

xzdandy commented Dec 4, 2023

Unstructured data to structured data conversion via EXTRACT_COLUMN #1338

Are you sure you want to change the base?

Unstructured data to structured data conversion via EXTRACT_COLUMN #1338

Conversation

hershd23 commented Nov 3, 2023

hershd23 commented Nov 3, 2023

review-notebook-app bot commented Nov 3, 2023

hershd23 commented Nov 3, 2023

hershd23 commented Nov 28, 2023 • edited Loading

hershd23 commented Nov 29, 2023

xzdandy commented Nov 29, 2023

hershd23 commented Nov 29, 2023

hershd23 commented Nov 30, 2023

xzdandy Dec 1, 2023

Choose a reason for hiding this comment

xzdandy commented Dec 1, 2023

hershd23 commented Dec 1, 2023

xzdandy commented Dec 2, 2023

pchunduri6 commented Dec 4, 2023

xzdandy commented Dec 4, 2023

Unstructured data to structured data conversion via `EXTRACT_COLUMN` #1338

Unstructured data to structured data conversion via `EXTRACT_COLUMN` #1338

hershd23 commented Nov 28, 2023 •

edited

Loading