-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstructured data to structured data conversion via EXTRACT_COLUMN
#1338
base: staging
Are you sure you want to change the base?
Conversation
new file: ../evadb/functions/extract_columns.py
@xzdandy I created a python notebook as well but it gets gitignored while the rest of the tutorial notebooks don't any idea? |
new file: 20-structured-data.ipynb
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
EXTRACT_COLUMNS
Solves #1235 |
@xzdandy @pchunduri6 moved to a "one-column-at-a-time" implementation as you recommended. The notebook has the implementation |
cadedd1
to
f38f866
Compare
modified: .gitignore Added file to extract on column at a time new file: evadb/functions/extract_column.py Removed the previous implementation deleted: evadb/functions/extract_columns.py Updated the notebook modified: tutorials/20-structured-data.ipynb
d84423f
to
a77dd26
Compare
For one column at a time I think this PR is ready for review @xzdandy @pchunduri6. For the other changes discussed with either of you, I think it makes sense to take that up in a separate PR else this will bloat. Let me know what you think |
…nto extract_columns
EXTRACT_COLUMNS
EXTRACT_COLUMNS
EXTRACT_COLUMNS
EXTRACT_COLUMN
Can we also add a long integration test for the function under https://github.com/georgia-tech-db/evadb/tree/staging/test/integration_tests/long/functions? We can skip the test in circle ci due to openai key, but I think it is good to have one. It can either be end-to-end (i.e., SQL queries) or directly test the function class. |
Yes @xzdandy on it |
modified: evadb/functions/extract_column.py new file: test/integration_tests/long/functions/test_extract_column.py modified: tutorials/20-structured-data.ipynb
Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to skip the notebook test at
Line 97 in c2457b2
PYTHONPATH=./ python -m pytest --durations=5 --nbmake --overwrite "./tutorials" --capture=sys --tb=short -v --log-level=WARNING --nbmake-timeout=3000 --ignore="tutorials/08-chatgpt.ipynb" --ignore="tutorials/14-food-review-tone-analysis-and-response.ipynb" --ignore="tutorials/15-AI-powered-join.ipynb" --ignore="tutorials/16-homesale-forecasting.ipynb" --ignore="tutorials/17-home-rental-prediction.ipynb" --ignore="tutorials/18-stable-diffusion.ipynb" --ignore="tutorials/19-employee-classification-prediction.ipynb" |
Remove the last empty cell. |
Do not have a collar link right now |
The current notebook actually does not work on the colab. I was trying to make it work yesterday and I think it needs several modifications. One fix can help is that can you add the |
Hi @pchunduri6, I think it depends on the task. If the extract column is based on patterns, I think we can generate regex for saving the cost and improve efficiency. On the other hand, if the task is semantic based, we need to rely on the LLM to extract the information. |
Added custom function for extracting columns from unstructured data
new file: ../evadb/functions/extract_columns.py