Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Store & Data Source updates #2

Merged
merged 69 commits into from
Oct 30, 2023
Merged

Conversation

a0x8o
Copy link

@a0x8o a0x8o commented Oct 24, 2023

No description provided.

github-actions bot and others added 30 commits September 30, 2023 02:36
Bump Version to v0.3.8+dev

---------

Co-authored-by: Jiashen Cao <[email protected]>
Adding support for `neuralforecast`. Fixes #1112.

```sql
DROP TABLE IF EXISTS AirData;

CREATE TABLE AirData (
    unique_id TEXT(30),
    ds TEXT(30),
    y INTEGER);

LOAD CSV 'data/forecasting/air-passengers.csv' INTO AirData;

DROP FUNCTION IF EXISTS Forecast;

CREATE FUNCTION Forecast FROM
(SELECT unique_id, ds, y FROM AirData)
TYPE Forecasting
PREDICT 'y'
HORIZON 12
LIBRARY 'neuralforecast';

SELECT Forecast(12);
```
One quick issue here is that `neuralforecast` needs `horizon` as a
parameter while training, unlike `statsforecast`. Thus, a better way to
call the UDF would be simply `SELECT Forecast();`, which is currently
unsupported. @xzdandy Please let me know your thoughts.

List of stuff yet to be done:

- [x] Incorporate `neuralforecast`
- [x] Fix `HORIZON` redundancy (UPDATE: Being fixed in #1121)
- [x] Reuse model with lower horizon no
- [x] Add support for ~multivariate forecasting~ exogenous variables
- [x] Add tests
- [x] Add docs

---------

Co-authored-by: xzdandy <[email protected]>
- [x] GitHub Data Source Integration
- [x] Batching support for native storage engine. We can not do batching
in storage engine, which does not work with limit. Revert the change.
- [x] Full NamedUser table support
- [x] Enable circle ci local PR cache for testmondata
- [x] Native storage engine `read` refactory
- [x] Testcases
- [x] Github data source documentation
The first step to do automatic index updates on insertions. 

Replace the old version of creating an index, which directly reads data
from the storage engine.

It now reads data from the children's plans: SeqScan and Storage.
Added documentation for vector stores including usage examples,
dependencies and other requirements.
Break the feature into multiple PRs. 

We can merge this PR after
#1244.
- [x] Remove empty evadb.db file
- [x] Move `test_github_datasource.py` to long integration tests. Fix
#1251
- [x] Fix the failing
`test/integration_tests/long/test_create_table_executor.py::CreateTableTest::test_should_create_table_from_select`.
- [x] Update documentation with links
Removing table names from the `dataframe` during `df()` call. The users
can then easily load CSV files generated using `EvaDB` with the
`to_csv()` call at a later time (for long-running or expensive queries).

Example:

```
select_query = cursor.query(
    f"SELECT * FROM {repo_name}_StargazerList;"
).df()

select_query.to_csv("stargazers_list.csv", index=False)

# Later
cursor.query(
        f"""
   CREATE TABLE IF NOT EXISTS {repo_name}_StargazerList(
   github_username TEXT(1000));
"""
    ).df()

cursor.query("LOAD CSV 'stargazers_list.csv' INTO {repo_name}_StargazerList;""").df()

```

Do we need the table names for any use cases? For example, for duplicate
column names from two different functions - `object_detector_1.labels`
and `object_detector_2.labels`?

---------

Co-authored-by: Andy Xu <[email protected]>
Co-authored-by: Andy Xu <[email protected]>
Users can now create a table with just `FLOAT` without providing the
dimensions.

Earlier:
```sql
CREATE TABLE ETTM1 (
        date TEXT(30),
        hufl FLOAT(5,7),
        hull FLOAT(5,7),
        mufl FLOAT(5,7),
        mull FLOAT(5,7),
        lufl FLOAT(5,7),
        lull FLOAT(5,7),
        ot FLOAT(5,7));
```

Now:
```sql
CREATE TABLE ETTM1 (
        date TEXT,
        hufl FLOAT,
        hull FLOAT,
        mufl FLOAT,
        mull FLOAT,
        lufl FLOAT,
        lull FLOAT,
        ot FLOAT);
```

Fixes #1260.

---------

Co-authored-by: Andy Xu <[email protected]>
…the query. (#1267)

- [x] Add basic functionality

Below is the example error message:

```
evadb.binder.binder_utils.BinderError: Cannnot find column name2. Did you mean name? The available columns are ['avatar_url', 'bio', 'blog', 'collaborators', 'company', 'contributions', 'disk_usage', 'email', 'events_url', 'followers', 'followers_url', 'following', 'following_url', 'gists_url', 'gravatar_id', 'hireable', 'html_url', 'id', 'invitation_teams_url', 'location', 'login', 'name', 'node_id', 'organizations_url', 'owned_private_repos', 'private_gists', 'public_gists', 'public_repos', 'received_events_url', 'repos_url', 'role', 'site_admin', 'starred_url', 'subscriptions_url', 'team_count', 'total_private_repos', 'twitter_username', 'type', 'url'].
```

**Limitation**: To keep the output clean, we only do fuzzy match on the
columns and skip the alias.

- [x] Add testcases.
jarulraj and others added 29 commits October 8, 2023 19:25
Fix #1271, Fix #1265, Fix #1266

~~Not able to fix 11-similarity-search-for-motif-mining.ipynb due to
#1275~~
updated the steps to create a new AI function with EvaDB.

---------

Co-authored-by: Andy Xu <[email protected]>
Reopen the #1111.

---------

Co-authored-by: sudoboi <[email protected]>
Co-authored-by: Abhijith S Raj <[email protected]>
text_summarization uses drop udf instead of drop function.
Test default values of `chunk_size` and `chunk_overlap`
@a0x8o a0x8o merged commit c3b45b6 into alexxx-db:master Oct 30, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.